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Preface 



Robot learning is an exciting and interdisciplinary field. This state is reflected 
in the range and form of the papers presented here. Techniques that have be- 
come well established in robot learning are present: evolutionary methods, neural 
network approaches, reinforcement learning; as are techniques from control the- 
ory, logic programming, and Bayesian statistics. It is notalbe that in many of 
the papers presented in this volume several of these techniques are employed 
in conjunction. In papers by Nehmzow, Grossmann and Quoy neural networks 
are utilised to provide landmark-based representations of the environment, but 
different techniques are used in each paper to make inferences based on these 
representations . 

Biology continues to provide inspiration for the robot learning researcher. 
In their paper Peter Eggenberger et al. borrow ideas about the role of neu- 
romodulators in switching neural circuits. These are combined with standard 
techniques from artificial neural networks and evolutionary computing to pro- 
vide a powerful new algorithm for evolving robot controllers. In the final paper 
in this volume Bianco and Cassinis combine observations about the navigation 
behaviour of insects with techniques from control theory to produce their visual 
landmark learning system. Hopefully this convergence of engineering and biolog- 
ical approaches will continue. A rigourous understanding of the ways techniques 
from these very different disciplines can be fused is an important challenge if 
progress is to continue. Al these papers are also testament to the utility of using 
robots to study intelligence and adaptive behaviour. Working with robots forces 
us to confront difficult computational problems that may otherwise be all too 
temptingly swept under the carpet. 

In this proceedings we present seven of the talks presented at the 8th Eu- 
ropean Workshop on Learning Robots in an update and expanded form. These 
are supplemented by two invited papers, by Ulrich Nehmzow on Map Building 
for Self-Organisation and by Axel Grossmann and Riccardo Poll on Learning a 
Navigation Task in Changing Environments by Multi-task Reinforcement Learn- 
ing. The workshop took place in the friendly surroundings of the Swiss Federal 
Institute of Technology, (EPFL), Lausanne, Switzerland, in conjunction with 
the European Conference on Artificial Life ’99. The workshop organisers would 
like to express their deep gratitude to Dario Floreano and all at EPFL who con- 
tributed to the success of the meeting. Finally, and certainly not least, we must 
applaud the programme reviewers who again this year, producing reviews of the 
highest standard. 
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Map Building through Self-Organisation 
for Robot Navigation 



Ulrich Nehmzow 



Department of Computer Science 
The University of Manchester 
Manchester M13 9PL 
United Kingdom 
ulrichScs .man. ac .uk 



Abstract. The ability to navigate is arguably the most fundamental 
competence of any mobile agent, besides the ability to avoid basic envi- 
ronmental hazards (e.g. obstacle avoidance). 

The simplest method to achieve navigation in mobile robot is to use 
path integration. However, because this method suffers from drift errors, 
it is not robust enough for navigation over middle scale and large scale 
distances. 

This paper gives an overview of research in mobile robot navigation at 
Manchester University, using mechanisms of self-organisation (artificial 
neural networks) to identify perceptual landmarks in the robot’s environ- 
ment, and to use such landmarks for route learning and self-localisation, 
as well as the quantitative assessment of the performance of such systems. 



1 Introduction 

This overview article describes the research in mobile robot navigation conducted 
at the University of Manchester. It addresses the question of mechanisms for 
robot map building, mobile robot self-localisation, and landmark selection. 

Experimental results are presented, and references to literature giving more 
details are provided. 



2 Motivation 

2.1 The Need for Navigation 

The ability to move cannot be exploited to its full potential, unless the agent — 
in this case a fully autonomous mobile robot (see figure H]) — is able to move to 
specific locations in its environment, that is: to navigate. 

Without the ability to navigate, the agent has to resort to strategies of ran- 
dom motion or tropisms, which are often inefficient. 
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Fig. 1. FortyTwo, the Nomad 200 robot used in the experiments. The robot is equipped 
with 16 sonar and 16 infrared sensors, flux gate compass and an odometry system. The 
camera visible in the photograph was not used in the experiments presented here. 



2.2 Fundamental Components of Navigation 

The ability to navigate entails the following four competences: 

1. Map building, 

2. self-localisation, 

3. map interpretation, and 

4. path planning. 

For the context of this paper, “map” stands for a one-to-one mapping between 
physical space (i.e. the world) and map space (i.e. the internal representation of 
that world). Such a map is not necessarily a metric representation — it can be 
topological as well. “Map building” refers to the construction of such a mapping, 
be it the construction of a metric “road map” , or some other representation such 
as an artificial neural network. “Self localisation” refers to the navigator’s ability 
to make the connection between the world and its representation, i.e. to establish 
his position in the map. Finally, “map interpretation” refers to the use of that 
mapping for navigational purposes such as homing, route following, or in general 
path planning of arbitrary paths. 

If any of these four competences is missing, full navigation (i.e. movement 
to an arbitrary goal) becomes impossible, and the navigator has to resort to 
either random movement (search behaviour), or tropism (moving towards an 
attractor) . 

2.3 Methods of Navigation 

Broadly speaking, there are two major options to achieve robot navigation. One 
is to use proprioceptive sensors such as wheel encoders, and to perform navi- 
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gation through path integration (also often referred to as dead reckoning). The 
other option is to use exteroceptive sensors, and to navigate using landmarks. 

Dead reckoning. Dead reckoning allows the navigator to determine its current 
position within the map, provided the following information is available: 

— Map, 

— known starting location, 

— current heading, and 

— either current speed or distance travelled. 

Starting from a known position (xq, yo), the navigator measures the distance 
travelled and the current heading. Using trigonometric relationships, the respec- 
tive movement in x and y direction can be determined, and an updated estimate 
of the navigator’s current position computed. 

The major problem with navigation by path integration is that distance 
travelled and, even more importantly, heading cannot be measured perfectly, but 
are subject to an unknown error. This leads to the accumulation of a drift error, 
which can only be corrected by referring to external information, for example 
landmarks. 

Landmark based navigation. An alternative method is to navigate with 
reference to perceptual landmarks that are detectable by exteroceptive sensors. 
Provided landmark recognition is unambiguous, this method does not suffer from 
the accumulation of incorrigible drift errors. 

However, if the recognition of landmarks is not unambiguous (“perceptual 
aliasing”), navigation can become unreliable, and additional methods have to 
be sought to establish the navigator’s position unambiguously. This is discussed 
later in section [42] 

2.4 Why Self- Organisation Is a Good Idea 

I argue that there are two reasons why mechanisms of self-organisation (i.e. 
learning) should be used in preference to pre-installed, fixed mechanisms in au- 
tonomous mobile robot navigation. The first one is a methodological consider- 
ation, the second reason has to do with the nature of the robot’s perception of 
its environment. 

1. Navigation is a core competence of all moving animals, and therefore plays 
a very important role in our lives. We perform navigation tasks so routinely that 
we are usually not even aware of the fact that we are navigating. Only when 
things get difficult — for example when we are lost or moving through unknown 
terrain — we speak of “navigating”. But we are, in fact, navigating all the 
time when moving, making decisions as to which way to turn (path planning), 
updating our representation of space (map building), etc. 

Because of this dominant, yet concealed role of navigation, it is virtually im- 
possible to avoid making assumptions regarding robot navigation that are based 
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on oitr experience. And yet, there are such enormous perceptual and behavioural 
differences between humans and robots that it is likely that these assumptions 
are ill founded. A typical example of this phenomenon is the choice of landmarks 
in an office environment. Almost invariably, humans rank doors very highly as 
landmarks suitable for navigation. It turns out, however, that for robots us- 
ing sonar sensors — one of the most widely used sensor modalities in mobile 
robotics — doors themselves are not that easy to detect. Door frames, on the 
other hand, show up very clearly on a sonar scan. 

The first argument, therefore, for using self-organisation in mobile robot nav- 
igation is that it can reduce the amount of predefinition put in by a human 
operator. 

2. Robot sensors are subject to noise. No two measurements of the same 
object are identical. Furthermore, the sensory perception of a robot can be plain 
wrong and therefore misleading (e.g. through specular reflection of a sonar sig- 
nal), contradictory (e.g. in information regarding the same object, coming from 
different sensors), or just useless (e.g. information from sensors that do not sup- 
ply any meaningful information with respect to the task being performed by the 
robot). 

To make sense of noisy, inconsistent, contradictory or useless data is a hard 
problem, in particular if the data is to be used in a clearly defined, crisp model. 

Learning mechanisms of self-organisation make use of the data that is actu- 
ally available, without prior assumptions. This takes care of useless data. Self- 
organising mechanisms such as artificial neural networks can cluster information 
topologically, which addresses the problem of noise. And finally, if there is infor- 
mation in the sensor signal that allows to detect inconsistency or contradiction 
(e.g. if one sensor produces a freak signal, but all neighbouring sensors provide 
the correct reading), mechanisms of self-organisation have proven to be a good 
method of eliciting that information. 

The second argument in favour of self-organisation, therefore, is that such 
mechanisms are well suited to process the kind of data obtained from robot 
sensors. 

3 Clustering Mechanisms 

3.1 Self- Organising Feature Map 

The self-organising feature map (SOFM), or Kohonen network, is a mechanism 
that performs an unsupervised mapping of a high dimensional input space onto 
a (typically) two-dimensional output space ( [Kohonen, 1988| ). 

The SOFM normally consists of a two-dimensional grid of units, as shown in 
figure 121 

All units receive the same input vector ^. Initially, the weight vectors Wj are 
initialised randomly and normalised to unit length. 

The output Oj of each unit j of the net is determined by equation [TJ 

Oj = Wj ■ z. 



( 1 ) 
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Fig. 2. Self-organising feature map. 



Because of the random initialisation of the weight vectors, the outputs of 
all units will differ from one another, and one unit will respond most strongly 
to a particular input vector. This “winning unit” and its surrounding units will 
then be trained so that they respond even more strongly to that particular input 
vector, by applying the update rule of equation After having been updated, 
weight vectors Wj are normalised again. 

Wj{t+l) = Wj{t)+r]{z-Wj{t)), ( 2 ) 

where 77 is the learning rate (usually a value around 77 = 0.3). The neigh- 
bourhood around the winning unit, within which units get updated, is usually 
chosen to be large in the early stages of the training process, and to become 
smaller as training progresses. Figure |2] shows an example neighbourhood of one 
unit around the winning unit (drawn in black). The figure also shows that the 
network is usually chosen to be torus-shaped, to avoid border effects at the edges 
of the network. 

As training progresses, certain areas of the SOFM become more and more 
responsive to certain input stimuli, thus clustering the input space onto a two- 
dimensional output space. This clustering happens in a topological manner, map- 
ping similar inputs onto neighbouring regions of the net. 



3.2 Reduced Coulomb Energy Networks 



The Reduced Coulomb Energy (RCE) Classifier ( [Reilly et ah, 198^ ) is an- 
other method of classification. Examples of its application to mobile robotics 
Kurz, 1994|Kurz, 6| and [Nehmzow and Owen, 20(50] 



are 



In the RCE net, each class is represented by a representation vector (R- 
vector). Training the RCE-Classifier involves determining the R- vectors. 
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When a pattern is presented to the classifier, the input is compared to each 
of the already existing R-vectors, using some form of similarity measure (e.g. dot 
product) in order to determine the R- vector of highest similarity with the cur- 
rently perceived pattern. If the similarity between the input pattern and the 
“winning” R-vector is within a pre-determined threshold then the input pat- 
tern belongs to the class of this winning R-vector. If the similarity is outside 
the threshold then the input pattern becomes a new R-vector. Thus the bound- 
aries of classes are determined by the nearest neighbour law. Figure shows an 
example for a two-dimensional input vector. 

Because of this learning procedure, the RCE network grows in size as more 
and more input stimuli are presented. 




Fig. 3. RCE-Classifier, two-dimensional example. Each “dot” in the diagram represents 
an R-vector. The circles surrounding each R-vector denote the “threshold area” within 
which an input pattern must fall in order to belong to the corresponding R-vector’s 
class. In the case of patterns falling within more than one threshold area, the nearest 
neighbour law applies. 



3.3 Adaptive Resonance Theory 

A third example of a self-organising artificial neural network is the Adaptive 
Resonance Theory (ART) network. The basic ART architecture consists of two 
fully connected layers of units, a feature layer (FI) which receives the sensory 
input, and a category layer (F2) where the units correspond to perceptual clus- 
ters or prototypes ( [Grossberg, 1988| , see figure E). There are two sets of weights 
between the layers, corresponding to feedforward and feedback connections. A 
“winner-takes-all” criterion is used during the feedforward phase, and a sim- 
ilarity criterion is used to accept or reject the resulting categorisation in the 
feedback phase. 
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Fig. 4. Basic ART Architecture. The units in the feature layer (FI) are completely 
connected to each of the units in the category layer (F2) , through two separate sets of 
weights. For clarity, only a few of the feedforward and feedback connections are shown. 



When an input pattern is presented to the network, it is compared to each of 
the existing prototypes (through the feedforward weights) to determine a win- 
ning node. If the similarity between the input pattern and the winning node 
(through the feedback weights) exceeds a vigilance threshold, then adaptive 
learning is activated and the stored pattern is modified to be more similar to the 
input pattern (the learning method depends on the particular variant of ART 
network used). Otherwise a “reset” occurs, where the current winner is disabled, 
and the network searches for another node to match the input pattern. If none of 
the stored prototypes is similar enough to the given input, then a new prototype 
will be created corresponding to the input pattern. The ART network, therefore, 
grows like the RCE network as new data is presented to the net. 

4 Applications to Mobile Robotics 

4.1 Route Learning 

Mechanism. Self-organising feature maps can be used to achieve route learning 
in a mobile robot ( [Nehmzow and Owen, 2000] ). In experiments conducted in 
Manchester, route following was accomplished by learning perception-reaction 
pairs that resulted in the robot staying on the trained route. Provided perceptual 
aliasing (the confusion of locations because of their perceptual similarity) is 
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either not present at all or low, the association of perception with action will 
successfully train a robot to follow a user-defined route. At Manchester, routes 
of over 100 m length are now routinely trained and learned. One such route is 
shown in figure 




Fig. 5. A route of about 100 m learned by Forty Two. 



To achieve noise resilience on the one hand, and sensor-motor association on 
the other, a self-organising feature map was used as an associator (see figures |2] 

and El). 

During a user-supervised training phase, sensor signals (11 sonar readings 
and 11 infrared readings) as well as user steering commands (a 2-bit encoding of 
motor commands) form the input to a SOFM. This means that due to the learn- 
ing mechanism of the SOFM, perception-action pairs are stored in the weight 
vector of each unit of the network. 

During autonomous traversal of the trained route, the motor action com- 
ponent of the SOFM is set to zero, and the winning unit of the network is 
determined using sensory perception alone. The winning unit’s weight vector is 
then used to determine the motor action required to stay on the trained route. 
This mechanism is shown in figure El 
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Fig. 6. SOFM as associator of perception-action pairs for route learning. 



Results. FortyTwo was successfully trained to learn routes within laboratory 
environments and in the corridors of the Department of Computer Science 
( Nehmzow and Owen, 2000|Nehmzow, 1999] ). Small deviations from the trained 
route due to perceptual aliasing do occur, but our observation was that the robot 
always encountered a unique sensory perception soon enough to regain the cor- 
rect route. In our experiments, 11% of the network’s cells fired in more than on 
location, i.e. were subject to perceptual aliasing. But, as said above, this did not 
prevent the robot from learning routes successfully. 



Extension to point to point navigation. That such a route learning sys- 
tem can be extended to allow navigation between arbitrary locations within a 
mapped area was subsequently demonstrated in experiments discussed in detail 
in Nehmzow and Owen, 2000] and [Owen, 2000| . 



In those experiments, a topological map based on a RCE network was ex- 
tended to include direction and distance information between landmarks — the 
so-called vector map. In conjunction with additional enabling competences (wall 
following, centring at a location, and returning to a previous location on the 
route), FortyTwo was able to determine arbitrary routes within the Computer 
Science Department at Manchester, and to follow them successfully. 



4.2 Self-Localisation 



Any map is meaningless if the navigator’s position within the map is unknown. 
The first step towards map-based navigation, therefore, is self-localisation. 

Experiments carried out in Manchester ( [Duckett and Nehmzow, 1996 and 
[Duckett and Nehmzow, 1998 ) address the problem of localisation in autono- 
mous mobile robots in environments that show a high degree of perceptual 
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aliasing. In particular, the work was concerned with the more general prob- 
lem of re-localisation, where the robot starts from an unknown location within 
the mapped area, and aims to establish its position using perceptual cues as well 
as information about its egomotion. 



Underlying mechanism. The fundamental mechanism used for self-localisa- 
tion is best explained using an example from ^^c^^^an^^e^mzo^^W9^This 
work was later extended, and is described in Duckett and Nehmzo w, 1998 and 
Puckett, 2000 



The architecture of the localisation system is shown in figure |7] 




Fig. 7. Localisation System Architecture. The solid-edged boxes denote behavioural 
modules, and arrows the dependencies between different parts of the system. The 
shaded box denotes the representation used for the map, and the dashed boxes show 
hardware components. 



During an exploration phase, the robot builds a map of its environment. 
This is done by using an artificial neural network such as ART. Other classi- 
fication mechanisms that categorise sensory perception can also be used (see, 
for instance, [Duckett and Nehmzow, 1998 ) . During the entire operation of the 
robot, i.e. during the map building phase and during the localisation phase, the 
orientation of the robot’s turret is kept constant, so that sensory perceptions are 
only dependent upon the robot’s location, not its heading. 
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The robot is then moved to a randomly chosen position within that environ- 
ment, where it will attempt to localise. By actively exploring, and accumulating 
evidence through the use of relative odometry between local landmarks, the 
robot is able to determine its location with respect to perceptual landmarks 
very quickly. 



Localisation algorithm. Possible locations within the map are stored as a 
list of hypotheses in working memory, each with an associated confidence level. 
Whenever the robot perceives a new location, either by recognising a change in 
(clustered) sensory perception or a change in odometry bigger than 2D (where 
Z? is a preset distance threshold) another set of possible locations is created. A 
matching procedure is then used to combine both sources of evidence to produce 
an updated list of hypotheses. The algorithm is given as follows, and explained 
in more detail below. 

0. Initialisation. Formulate set of hypotheses, H = {ho, consisting of 

location points which match the current ART category. Initialise confidence 
levels: V/ii G H, assign conf{hi) = 1. 

1. Wait until ART category changes or robot has moved distance 2D, where D 
is the distance threshold used during map building. 

2. For each hypothesis hi add change in odometry (Aa;, Ay) to the coordinates 
of hi, (xhi,yhi)- 

3. Generate second set of candidate hypotheses, H' = {h'o, ■■■,h'j.^}. 

4. For each hi, find the nearest h'p recording distance apart, dhi- 

5. Accumulate evidence, given distance threshold T, minimum confidence level 
MIN, gain factor GAIN > 1 and decay factor 0 < DECAY < 1: 

V/ii G H 

if dhi < ^ then 

replace {xhi,yhi) with {xh'.iVh'.) from the matching ft,' 
let conf{hi) = conf{hi) x GAIN 

else 

let conf{hi) = conf{hi) x DECAY 
if con f {hi) < MIN then delete hi. 

6. Delete any duplicates in H , preserving the one with the highest confidence 
level. 

7. Add all remaining ft' from H' which are not already contained in H to H, 
assigning conf{h'j) = 1. 

8. Repeat from step 1. 

In step 2, the existing set of location estimates are updated by adding on the 
change in position recorded by relative odometry. A new set of candidate hy- 
potheses is then assembled by selecting all possible location points from the map 
which match the current ART category. A search procedure follows, where each 
of the existing hypotheses is matched to its nearest neighbour in the new candi- 
date set. Step 5 then uses a threshold criteria to determine whether each of the 
matched hypotheses should be increased or decreased in confidence. This means 
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that if the matched position estimates are sufficiently close to each other, then 
this is judged to be a piece of evidence in favour of this particular hypothesis. 
The confidence level is therefore raised by a gain term, and the position esti- 
mate from the old hypothesis is replaced by the new value. Thus, good position 
estimates are continually corrected on the basis of perception. 

Conversely, if the match distance exceeds the threshold, then the associated 
confidence level is scaled down by a decay term, and the position estimate is left 
uncorrected, i.e, it is not replaced by its nearest neighbour in the candidate set. 
Hypotheses which fall below a certain confidence level are rejected and pruned 
from the list of possible locations, eliminating the bad hypotheses quickly and 
minimising the search space. Duplicates created by the matching process are 
deleted here too. 

Any unmatched location points remaining in the candidate set are also added 
to the current list of hypotheses, and are assigned an initial confidence value. 
This will be the case for all of the candidate hypotheses on the first iteration, 
as the hypothesis set will initially be empty. (Initialisation has been included 
as step 0 here for clarity, although this is actually unnecessary.) All possible 
locations with the current perceptual signature are always considered, so that 
the algorithm can cope with unexpected, arbitrary changes of position. 

Eventually, one of the hypotheses emerges as a clear winner. For example, 
in experiments conducted on FortyTwo in our laboratory, this took an average 
of 7 iterations through the algorithm, in a mean time of 27 seconds with the 
robot travelling at 0.10 ms“^. If the robot becomes lost again, confidence in this 
particular hypothesis gradually decays as a new winner emerges. 

Figure [8] shows a map of our laboratory environment. The underlying ref- 
erence frame of this map is a Cartesian space. The clusters indicate the size 
of areas within which the robot’s ART map perceives the same landmark, and 
numbers within the clusters indicate the ART category being perceived at that 
location. It is obvious from that map that perceptual aliasing occurs, because 
there are several locations that all elicit the same response from the network. 



Results. The system was tested on FortyTwo in the robotics laboratory at 
Manchester University. A wall following behaviour was used for exploration, 
and an ART2 network configured to receive input from the robot’s 16 infra-red 
sensors. Map building was carried out during a single circuit of an enclosure 
made from walls and boxes. This was purposely designed to contain areas of 
perceptual aliasing, as shown in the map (figure [H|). The average forward speed 
of the robot travelling around this enclosure was 0.10 ms“^. 

The algorithm was run for 20 iterations per trial for 30 trials. The results 
given in figure El show a steady downward trend in the mean localisation er- 
ror over time. The “actual” position of the robot had to be taken from global 
odometry, because no external means of recording FortyTwo ’s real position is 
currently available. The results shown are therefore only as accurate as the drift 
in odometry will allow. 
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Fig. 8. Map of a laboratory environment in Manchester. Numbers indicate the percep- 
tual category identified by the ART network, locations of these perceptual clusters are 
given in Cartesian coordinates. 
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In addition, the time and number of iterations through the algorithm taken 
for the localisation error to fall below the distance threshold D was measured 
in each trial. The mean number of steps was 7.0, with a standard deviation of 
5.6. The mean time taken was 27.3 seconds, with a standard deviation of 21.9 
seconds. (The average curve shown in figure |9] takes longer than this to reach the 
D level because this calculation takes into account the worst case localisation 
errors.) 

Inspection of individual localisation attempts showed that the algorithm was 
able to localise successfully once the robot had travelled through a unique se- 
quence of perceptions along the perimeter of the enclosure. For example, follow- 
ing the route taken by the robot clockwise in figure [HI the sequence “0, 1, 5” 
of ART2 categories uniquely identifies the point at (-1.28, 2.81). However, the 
sequence “0, 1” would be ambiguous, because there are two possible places in 
the route where this occurs. 



Quantitative localisation performance assessment. While it was possible 
to make qualitative ]ndgeTmnts about the behaviour of the localisation system on 
the real robot, the use of global odometry — which is subject to incorrigible drift 
errors — for calculating the localisation error makes it very difficult to make a 
really accurate assessment on the performance of the localisation algorithm. We 
have therefore introduced quantitative localisation performance measures which 
are independent from robot odometry. These are discussed in the next section. 

5 Assessing Localisation Performance 
Using Contingency Table Analysis 

5.1 Contingency Table Analysis 

In order to assess the performance of the localisation system, we collected loca- 
tion-response pairs of the localisation system in a contingency table like the one 
shown in figure [TOl 

Such a contingency table states how many times a particular localisation sys- 
tem response was obtained at the various physical locations the robot occupied 
during its operatioij^- Contingency table analysis can then be used to determine 
whether there is a statistically significant relationship between localiser response 
and position of the robot ( [Press et al., 1992|Nehmzow, 199^ ). 

Particularly useful in this respect is the uncertainty coefficient U, defined by 
equation O 



U{L I R) = 



H{L) - H{L I R) 
H{L) 



(3) 



with the respective entropies defined as H{L \ R) = H{L, R) — H{R), 
H{L,R) = -X^r./PwlnprZ and H{L) = -Y,ip.ilnp.i. 

^ The physical environment was compartmentalised into distinct zones for the purpose 
of indicating the robot’s current position in the world. 
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Fig. 10. Example Contingency Table. The rows correspond to the response produced 
by the localisation system, and the columns to the “true” location of the robot as 
measured by an observer. This table represents 100 data points, and also shows the 
totals for each row and column. 



The probabilities are defined by equations S] and [3 

N.i 



P-i = 



N 



Prl = 



N ’ 



( 4 ) 



( 5 ) 



and the column totals N.i and the table total N are given by equations 0 
and [ 7 ] respectively. 



II 


(6) 


N = Y,Nrl- 


( 7 ) 



r,Z 



The uncertainty coefficient U takes a value between 0 (“no correlation be- 
tween location and localiser response”) and 1 (“perfect localisation”). 



5.2 Application to Robot Self-Localisation 

We used the uncertainty coefficient U, as introduced in the previous section, to 
compare the localisation performance of three different localisation mechanisms, 
in six different real world environments. 

The three mechanisms used were 

1. Dead reckoning, using the robot’s on-board odometry system, 

2. landmark-based localisation, using sonar and infrared sensor information 
clustered by a nearest neighbour classifier, and 
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3. evidence based localisation, using sonar and infrared sensor information, as 
well as local odometry information and the change of perception as known 
movements are executed (for details see | Duckett and Nehmzow, 1998 ). 



The six environments in which these three localisers were compared are shown 
in table [T1 



Table 1. Characterisation of the six different environments in which entropy-based 
performance measures were applied to the robot’s evidence-based localisation (EBL) 
system. 



Description 


Route Location No. of Places in 
in m Bins Trials EBL Map 


A Drinks-machine area 


60 


24 


298 


88 


B T-shaped hallway 


54 


14 


263 


71 


C L-shaped corridor 


146 


40 


474 


185 


D Small empty room 


23 


8 


232 


33 


E Single corridor 


51 


14 


248 


61 


F E plus moving people 


51 


14 


249 


61 



Table i shows the localisation results obtained using localisation by dead 
reckoning, localisation using a landmark classifier based on a self-organising fea- 
ture map, and localisation using the evidence-based localiser discussed in sec- 
tion [T2] 



Table 2. Summary of Results for evidence-based localisation, localisation by percep- 
tual landmarks, and localisation by uncorrected dead reckoning (cf. figure |TTJ. U{L\ R) 
is the mean uncertainty coefficient. 





EBL 


Landmark 


Dead 




Dist./m for 


Classifier 


Reckoning 




U{L 1 R) = 0.9 


U{L 1 R) 


U{L 1 R) 


A 


0.0 


0.925 


0.663 


B 


3.8 


0.814 


0.724 


c 


13.6 


0.739 


0.785 


D 


0.5 


0.878 


0.786 


E 


3.0 


0.776 


0.864 


F 


4.7 


0.690 


0.686 



Table 12] and figure [TT] show how evidence-based localisation consistently pro- 
vides the best localisation results, and how in dead reckoning localisation per- 
formance decreases as the travel distance increases. Localisation purely based on 
landmarks shows consistent performance versus distance travelled, as it uses only 
perceptual information and does not take temporal information into account. 
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Fig. 11. Localisation results in the six different environments. In all environments local- 
isation by dead reckoning deteriorates with increasing travel distance, while localisation 
using evidence (EBL) improves. Localisation based purely on landmark recognition is 
not affected by distance travelled (U(L \ R) = 1 for perfect localisation). 



6 Open Questions and Ongoing Research 

Having developed mobile robot navigation mechanisms that allow mobile robots 
to learn routes, to localise, and to plan paths within middle scale environments 
of several hundred m^ size, we are currently looking at mechanisms that make 
these navigation systems more efficient. 
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6.1 Learning Parameters and Network Structure 

For robot applications in potentially unbounded environments, for example for 
applications of continuous operation (“lifelong learning”), networks with a fixed 
capacity (such as the self-organising feature map) will eventually saturate. At 
that point, it is impossible to store more landmarks without erasing some infor- 
mation from the net. 

We have therefore used growing networks such as ART and RCE, and are cur- 
rently investigating modifications to the self-organising feature map that would 
also address the problem of saturation. This work, however, is still in its early 
stages. 

6.2 Landmark Selection 

One possible method increasing the efficiency of a map-based navigation system 
is not to store every perceptual landmark encountered, but to make a selection. 
One possible method of making such a selection is novelty detection. 

Novelty detection. The primary purpose of our research in novelty detection 
for mobile robotics is to detect deviations from a norm by means of acquiring 
a “standard model” of the robot’s environment, and to flag deviations from 
that model that exceed a predefined threshold. Possible applications include 
monitoring tasks, whose objective it is to detect faults and critical deviations 
from the norm in the robot’s environment. 

Such a novelty detection system can be used for landmark selection, too, by 
eliminating common perceptual landmarks, i.e. those landmarks which say little 
about the robot’s current position. 

[Marsland et ah, 2000| describe the application of a novelty detector to iden- 
tifying novel stimuli impinging on a robot’s sensor. To achieve this, sensory 
perceptions (one of four perceptually unique light stimuli) were first clustered 
using either a standard self-organising feature map (see figure , or a modified 
self-organising feature map with leaky integrators, i.e. a network whose neuron 
activity decays over time. 

The output of the winning unit of the network was then fed into a habituating 
output neuron, whose synaptic efficacy y{t) decreased according to equation [51 

= a[yo-y{t)]- S{t), (8) 

The effect of this equation is shown in figure IT21 

The return to high synaptic efficacy at time 150 in figure [El is due to a dis- 
habituation behaviour, which can be activated or de-activated according to the 
application’s needs. 

Initial experiments show that habituating, self-organising maps such as these 
can be used for landmark selection as well. Only landmarks that present “rare” 
stimuli pass the filter, reducing the number of stored landmarks per map area 
considerably. Again, the application of landmark selection through novelty fil- 
tering is subject to ongoing research. 
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Fig. 12. An example of how the synaptic efficacy drops when habituation occurs 
(equation!^. Both stimulus and synaptic efficacy are shown: for 0 < t < 150 the 
stimulus is constantly “1”, then drops to “0”. 



7 Summary and Conclusion 

7.1 Self Organisation for Map Building, Route Learning 
and Self-Localisation 

The ability to navigate is one of the most important competences for any moving 
agent. Without purposive motion, the advantages of mobility cannot be fully 
exploited. 

One of the core components of a navigational skill is the ability to construct 
internal representations (“maps”) of the world. A map can then be used for 
navigation if the agent has the additional competences of self localisation within 
the map, map interpretation, and path planning. 

This paper presents two examples of how map building in a mobile robot can 
be achieved through self organisation. Using a self organising mechanism has 
the double advantage of minimising the use of user-defined assumptions, and of 
taking the actual sensory perceptions into account when map building, rather 
than abstracted model perceptions. This increases the navigator’s resilience to 
noise. 



Route learning and point to point navigation. In a first robot navigation 
example, a self-organising feature map — an artificial neural network that clus- 
ters the robot’s sensory perceptions topologically — was used in a route learning 
and free navigation system. During a supervised training phase, the network was 
trained, using sensory perceptions as well as steering commands as inputs. In 
this manner, perceptions were associated with motor actions. 

During the subsequent autonomous route following phase, sensory percep- 
tions alone were used to determine which of the network’s artificial neurons 
responded most closely to that particular sensory stimulus. The motor action 
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component of the weight vector of that neuron then identified the required motor 
action. 

Self localisation and localisation performance assessment. A second ex- 
ample showed how an Adaptive Resonance Theory self-organising network can 
be used to establish the robot’s position within the map, having started from an 
unknown location. This is the “lost robot problem” . 

Integrating both perceptual and temporal information, the robot constructs 
a list of map locations that have the same perceptual signature as the robot’s 
current location in the world. Taking into account the robot’s egomotion as 
determined through dead reckoning, the list of hypotheses is continuously short- 
ened, until only one candidate location remains. At this point, self-localisation 
is achieved. 

In our experiments in Manchester, our mobile robot FortyTwo was always 
able to self-localise. The travel distance this would take was dependent on the 
environment the robot operated in, but was typically 2 to 5 m of travel, and 
hardly ever exceeded 10 m. 

To assess the performance of our localisation system and to compare it with 
other localisation methods, we used contingency table analysis, which allowed us 
to grade the localisation performance quantitatively, without using drift-affected 
odometry information at all. 

7.2 Problems of Map Building through Self-Organisation 

We identified two major problems with the approach to map building and nav- 
igation through topological clustering of sensory information, and proposed so- 
lutions to these problems, too. 

Perceptual aliasing. Constructing maps from topologically clustered sen- 
sory perceptions will result in a certain degree of perceptual aliasing (in our 
route learning experiments, a little over 10% of all network units fired in more 
than one location). 

Increasing the sensor resolution of the robot might reduce this number, but 
may not lead to better navigation systems, because very high sensor resolution 
would reduce the recognition rate of landmarks on subsequent visits. It would 
also increase the information processing burden as additional, often irrelevant 
information would have to be processed. 

The way we addressed the problem of perceptual aliasing is discussed in 
section 14.21 Retaining the relatively low sensor resolution of the robot, we dif- 
ferentiated between two locations that looked alike by taking into account the 
path (i.e. the temporal pattern of perceptions) by which the robot arrived at the 
current location. 

Saturation. Unless the self-organising map building mechanism is to be 
used in a closed environment of limited size, the problem of network saturation 
becomes relevant. 

If a network of finite, constant size such as the self-organising feature map 
is used, there will be a point at which further landmark information can only 
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be stored in the net if previously stored information is removed. This is usually 
undesirable. 

To address this problem, we used growing neural networks such as the Re- 
stricted Coulomb Energy network f section 13.21) or Adaptive Resonance Theory 
(section 13.31 1 . Both networks add new neurons when novel information is en- 
countered. One problem in this respect is to find the right balance between 
generalisation and specificity, i.e. to determine the parameter that controls the 
adding of new neurons. We are not aware of a principled approach to find the 
right parameter, and have used trial and error in our experiments. 

7.3 Focusing the Stored Information through Landmark Selection 

One approach to increasing the information content of the map that we consider 
promising is that of landmark selection. The idea is that instead of storing all 
sensory perceptions that the robot encounters during the mapping phase, only 
those perceptual landmarks are stored that differ significantly from commonly 
encountered landmarks. 

Initial experimental results of using a habituating self-organising map to act 
as a landmark selector, a novelty filter, have proven promising, and we believe 
that the addition of a landmark selection filter prior the map builder (section [62]) 
will improve the performance and efficiency of landmark based navigation sys- 
tems. 
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Abstract. This work is concerned with practical issues surrounding the 
application of reinforcement learning to a mobile robot. The robot’s task 
is to navigate in a controlled environment and to collect objects using 
its gripper. Our aim is to build a control system that enables the robot 
to learn incrementally and to adapt to changes in the environment. The 
former is known as multi-task learning, the latter is usually referred to 
as continual ‘lifelong’ learning. First, we emphasize the connection be- 
tween adaptive state-space quantisation and continual learning. Second, 
we describe a novel method for multi-task learning in reinforcement en- 
vironments. This method is based on constructive neural networks and 
uses instance-based learning and dynamic programming to compute a 
task-dependent agent-internal state space. Third, we describe how the 
learning system is integrated with the control architecture of the robot. 
Finally, we investigate the capabilities of the learning algorithm with re- 
spect to the transfer of information between related reinforcement learn- 
ing tasks, like navigation tasks in different environments. It is hoped that 
this method will lead to a speed-up in reinforcement learning and enable 
an autonomous robot to adapt its behaviour as the environment changes. 



1 Introduction 



1.1 Reinforcement Learning for Robots 

It is desirable that our robots learn their behaviours rather than have them hand- 
coded. This is the case because programming all the details of the behaviours 
by hand is tedious, we may not know the environment ahead of time, and we 
want the robots to adapt their behaviours as the environment changes. 

Reinforcement learning (RL) has been used by a number of researchers 
as a computational tool for building robots that improve themselves with expe- 
rience II, Strictly speaking, reinforcement learning is a problem 

formulation. It defines the interaction between a learning agent and its environ- 
ment in terms of states, actions, and rewards. A reinforcement-learning agent 
improves its performance on sequential tasks using reward and punishment re- 
ceived from the environment. There is no teacher that tells the agent the correct 
response to a situation when the agent performs poorly. An agent’s only feed- 
back, the only indication of its performance, is a scalar reward value. The task of 
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the learning algorithm is to find a policy, mapping environment states to actions, 
that maximises the reward over time. 

Reinforcement learning methods try to find an optimal policy by comput- 
ing a cumulative performance measure from immediate rewards using statistical 
methods and dynamic programming. There are a number of reinforcement learn- 
ing techniques that work effectively on a variety of small problems ^\MM- 
But only few of them scale up well to larger problems. As generally known, it 
is difficult to solve arbitrary problems in the general case. Kaelbling et al. [l^ 
suggested therefore to give up tabula rasa learning techniques and to guide the 
learning process by shaping, local reinforcement signals, imitation, problem de- 
composition, and reflexes. These techniques incorporate a search bias into the 
learning process which may lead to speed-ups. 

Apart from the scalability problem, which is independent of the applica- 
tion domain, additional difficulties arise if reinforcement learning methods are 
applied in robotics. Such problems are the existence of hidden state, the slow 
convergence in stochastic environments with large state spaces, and the need for 
generalisation. 

In general, a robot will have limitations in observing its environment. Some 
uncertainty will remain in the agent’s belief about the current state of the world. 
The usual approach is to use memory of previous actions and observations to 
aid the disambiguation of the states in the world. In this work, we assume that 
a method for state disambiguation is available. Although there may be a great 
deal of uncertainty about the effects of an agent’s actions, there is never any 
uncertainty about the agent’s current state. 

1.2 Speeding Up the Learning Process by Incremental Learning 

Ideally, we would like to build mobile robots that learn incrementally and that 
adapt to changes in their environment during their entire lifetime. Hopefully, this 
will also speed up the learning process. In most RL work to date, the algorithms 
do not use previously learnt knowledge to speed up the learning of a new task. 
Instead, the learning agent either ignores the existing policy, or worse, the current 
policy harms the agent while learning the next one. Therefore, we would like to 
address the issue of learning multiple tasks in reinforcement environments. 

The main challenge in building agents capable of incremental learning is to 
find effective mechanisms for the transfer of information between learning tasks. 
Recently, this need has been recognised in the machine learnin g co mmunity by 
initiating a new research direction, termed ‘learning to learn’ [Sl|. Transfer in 
supervised learning involves reusing the features developed for one classification 
or prediction task as a bias for learning related tasks Transfer in 

reinforcement learning involves reusing the information gained while learning to 
achieve one goal for learning to achieve other goals more easily At the 

time of writing, only very few of these approaches have been applied to real 
robots [^. 

Ring [2!tI ] defined continual learning as incremental, hierarchical development. 
He suggested that an agent should make use of previously learnt information 
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when it has to learn a new behaviour. Moreover, the behaviours should be learnt 
in a bottom-up process whereby the old behaviours are used as constituents of 
newly created behaviours. 

In this work, we embrace these ideas and focus on two important problems of 
continual learning: the adaptability to changes in the environment and multi-task 
learning. First, a continual-learning agent should be able to cope with an envi- 
ronment that changes unpredictably with time. It should adjust its behaviour 
to both gradual and sudden change 20 1. Second, the agent should learn incre- 
mentally. That is, in learning multiple tasks, it should improve its performance 
at each task with experience and with the number of tasks [3^] . For an algo- 
rithm to fit this definition, some kind of transfer must occur between multiple 
tasks. This means that the representations used by the lear ning method must 
support incremental learning. Tree-structured representations [1 81 ] and construc- 
tive neural networks do so and therefore can solve these two problems. Here, we 
use connectionist function approximators. Since incremental learning in neural 
networks using distributed representations is still an open issue, we decided to 
use neural networks that are based on local representations only. 

The aim of this work is to make a contribution towards the development of 
continual learning methods applicable to mobile robots. We propose a learning 
technique that combines incremental learning in constructive neural networks 
with the computational efficiency of instance-based learning and dynamic pro- 
gramming methods. Moreover, we present a robot control architecture that is 
suitable for reinforcement learning. For state estimation, we use a position track- 
ing system which is based on sonar sensor readings. The learning algorithm is 
implemented as a module of the control architecture. In selected experiments, 
we show that the proposed learning algorithm allows the transfer of information 
between related reinforcement learning tasks in a real robot. 

We do not attempt to address in this paper all the issues related to continual 
learning. For example, we do not discuss methods for exploration. Exploration 
is surely needed in changing environments. We present preliminary results. The 
learning algorithm does not include any means of forgetting yet. In the presen- 
tation, we focus on a description of the algorithm, but do not give a detailed 
quantitative comparison of the learning methods. 

The paper is organised as follows. In Section [21 we introduce reinforcement 
learning. We provide a formal framework for learning from reinforcement and 
summarise methods for finding an optimal policy. In Section (HI we critically 
examine previous work on learning with adaptive state spaces, introduce a crite- 
rion for the distinction of states, and discuss the main challenges of incremental 
learning. In Section 01 we introduce a novel constructive neural-network model, 
called dynamic cell structures, which can be used for multi-task reinforcement 
learning. In Section 01 we describe the control architecture of the robot, which 
consists of modules for low-level control, vision-based object detection, position 
tracking using sonar sensors, and learning. The proposed learning algorithm is 
applied to a mobile robot which has to learn a navigation task in different envi- 
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ronments. Preliminary results of this application are given in Section!^ Finally, 
we draw some conclusions in Section 0 

2 Reinforcement Learning 

2.1 Framework 

This section provides a formal notation for describing the environment and the 
agent. It has been adapted from previous work by Kaelbling et al. j^. We ap- 
proximate the world by making the state space, the action space and the time 
discrete. At each time step, an action is executed, and the world instantaneously 
transforms itself from its previous state to the new state that results from that 
action. These assumptions are universally considered valid for real robots. 

The agent’s environment can be defined as tuple {X^A,T), where 

— A is a finite set of states of the world, 

— A is a finite set of actions, and 

— T : X X A ^ X is the state-transition function, giving for each world state 
and agent action a probability distribution over world states. 

The agent’s reward is defined by the reward function, ii : A x A — >■ K, giving 
the expected immediate reward gained by the agent for taking each action in 
each state. A policy tt is a mapping from A to A specifying an action to be taken 
in each situation. 

The tuple (X,A,T,R) defines a Markov decision process (MDP) if future 
states and rewards are conditionally independent from past states and actions. 
The state and reward at time t -I- 1 is dependent only on the state at time t and 
the action at time t. This is known as the Markov property. 

In the description of the agent, we have to take into account that the agent 
may not be able to determine the state it is currently in with complete reliability, 
as it is the case for a robotic agent. We define the agent’s perception of the 
environment as a tuple (17,0), where 

— 17 is a finite set of observations the agent can experience of its world, and 

— 0:AxA— >-77(17) is the observation function, which gives for each action 
and resulting state a probability distribution over possible observations. 

The tuple (A, A, T, 7?, 17, 0) defines a partially observable Markov decision 
process (POMDP). Kaelbling et al. proposed to decompose the problem of 
controlling a POMDP into two parts, as shown in Figured] The agent makes ob- 
servations, o G O, and generates actions, a G A. It keeps an internal belief state, 
b, that summarises its previous experience. A belief state is a discrete probabil- 
ity distribution over the set of world states. A, representing for each state the 
agent’s belief that it is currently occupying that state. The state estimator (SE) 
is responsible for updating the belief state based on the last action, the current 
observation, and the previous belief state. The component tt is the policy. As 
before, it is responsible for generating actions, but this time as a function of the 
agent’s belief state rather than the state of the world. 
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Fig. 1. A POMDP agent can be decomposed into a state estimator SE and a policy vr. 

2.2 State Estimation 

When the agent is provided with a POMDP model of its environment and the 
agent uses the belief state as the state space on which to specify its policy, the 
key difficulty is that this state space is continuous and has as many dimensions as 
there are world states. This makes the task of finding an optimal POMDP con- 
trol strategy computationally intractable. On the other hand, we have efficient 
computational techniques for solving (completely observable) Markov decision 
processes. That is, a POMDP problem can be solved if the agent can be pro- 
vided with a state estimator for resolving the uncertainty about the state. In 
many domains, such state estimation methods are indeed available. For exam- 
ple, Kalman filters [l^ and Bayesian methods jHl 0 have been used for state 
estimation in mobile robot navigation. The belief state has been represented by 
unimodal Gaussian distributions (Kalman filter), sets of Gaussians, probability 
grids, or sampling methods (Markov localisation). 

A position tracking system which is based on sonar-sensor readings and that 
can be used as state-estimator in robot-navigation tasks is described in Section |5] 
So, in the following, we assume the availability of a state estimator. The output 
of the state estimator will be the most likely state, i.e., the world state with the 
highest probability. Moreover, we assume that 5 is a finite set of discrete states 
of the agent, and there is a mapping A — )> 5 so that {S, A, T, R) is a finite state 
Markov decision process. That is, there is no hidden state. The information about 
the agent’s state, s G S, combined with the POMDP model of the environment 
is sufficient to predict, in principle, the future interaction of the agent with the 
environment. 

2.3 Finding a Policy Given a Model 

Given a policy tt and a reward function R, the value of a state s G 5 is the 
expected value of sums of the rewards to be received at each future time step. 
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discounted by how far into the future they occur. That is, 



VAs) = E 






j=o 



where rt is the reward received on step t of executing policy tt after starting in 
state s. A closely related quantity a) is the value of state s when action a is 
executed first and policy tt is followed thereafter. The discount factor, 0 < 7 < 1, 
controls the influence of the rewards in the distant future. When 7 = 0, the value 
of a state is determined entirely by rewards received at the next step. We are 
generally interested in problems with further horizon and so we set 7 to be 
near 1. 

So far, we have considered learning tasks in which the agent-environment 
interaction goes on continually without limit. The above definition of V^(s) can 
be applied to episodic tasks (trials) as well. Episode termination corresponds 
to entering a special absorbing state that transitions only to itself and that 
generates only rewards of zero. 

The function V can be defined recursively [2^ as: 



14(s) = R{s,a) -|-7^T(s,a, s')K-(s'). 

s'GS 



We are interested in finding an optimal policy, i.e., a policy tt that maximises 
VTr{s) for all s G 5. We shall use tt* to refer to an optimal policy for an MDP, 
and express the optimal value and Q functions as E(s) and Q(s, a). The optimal 
value function is unique and can be defined as the solution to the simultaneous 
equations 

V (s) = max I R{s, a) -I- 7 T(s, a, s')V (s') 
aeA \ 

for all s G S. Given the optimal value function, the optimal policy can be 
specified as: 



7t(s) = arg max i?(s, a) + 77^ ^(s, a, s')E(s') 

“6A \ 7— t 

\ s'gS 

There are many methods for finding optimal policies for MDPs. One way is 
to find the optimal value function. It can be determined by a simple iterative 
algorithm called value iteration (^. The algorithm is shown in Figure E] It 
stores approximations to P(s) and Q{s,a). The algorithm terminates when the 
maximum difference between two successive value functions is less than some e. 




2.4 Learning an Optimal Policy 

In the previous section, we have assumed that we had already a model. The 
model consists of knowledge of the state transition-probability function T and 
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Vi(s) •«— 0 for all s £ 5 
1 1 - 1 

loop 

loop for all s £ 5 

loop for all a £ ^ 

Qt{s, a) t- R{s, a) + 7 I] T(s, a, s')Et-i(s') 

s' 

end loop 

Vtis) ^ max Qtis, a) 

aeA 

end loop 

until |Tt(s) — Vt-i(s)| < e for all s £ 5 



Fig. 2. Finding the optimal value function through value iteration. 



the reinforcement function R. We are interested in applications for which such 
a model is not known in advance. The learning agent must interact with its 
environment directly to obtain information which can be processed to produce 
an optimal policy. 

There are two classes of reinforcement learning algorithms: model-free algo- 
rithms which learn a policy without learning a model, e.g., Q-learning and 
the TD(A) algorithm | l28j . and model-based algorithms which learn such a model 
and use it to find an optimal policy, e.g., prioritised sweeping 0 and real-time 
dynamic programming [^. For the general case, it is not clear yet which approach 
is better in which circumstances. 

3 Learning with Adaptive State Spaces 

In the ideal case, the agent’s state-space is the set of all possible partial solutions 
to the given learning task. Actions are ways of getting from one state to another. 
A successful solution consists of a sequence of actions in that state space. In most 
RL applications, the agent designer predefines the possible perceptual inputs, the 
internal state-space, and the actions of the learning agent. Pre-selecting the state 
space by a human programmer may be very effective because it can provide a 
useful bias for the learning algorithm. However, this approach will fail if applied 
to multi-task learning. Generally, the learning agent does not know beforehand 
about the perceptual input it will encounter. So, the agent should be able to 
select its state space automatically while learning. 

3.1 Adaptive Resolution Methods 

The automatic extraction of relevant states is particularly important for learn- 
ing in multidimensional continuous state spaces. What we would like to do is 
partition the environment into regions of states that can be considered the same 



30 



A. Grofimann and R. Poll 



for the purpose of learning and generating actions. If this discretisation is too 
coarse-grained, the agent will fail. If the discretisation is too fine-grained the 
number of states is too big more than needed and learning becomes compu- 
tationally intractable. To be able to generalise efficiently over input, we need 
different granularity in different regions of state space. This problem has been 
addressed in methods that use adaptive resolution such as variable resolution 
dynamic programming (VRDP) and parti-game 22 1 . 

The VRDP algorithm uses a fcd-tree, which is similar to a decision tree, to 
partition the state space into coarse regions. The coarse regions are refined into 
detailed regions, but only in parts of the state space which are predicted to be 
important. This notion of importance is obtained by running ‘mental trajectories’ 
through state space, i.e., executing the current optimal actions off-line. It is 
assumed that those states encountered during mental practice are states which 
are particularly important to know about it detail. A disadvantage of VRDP is 
that it partitions the state space into high resolution everywhere the agent visits, 
potentially making many irrelevant distinctions. An advantage of VRDP is that 
the part of the problem which truly requires learning (building the model of the 
environment) is entirely decoupled from the part which requires computation 
(the search for an optimal strategy). Dynamic programming is performed only 
on the extracted states. 

Parti-game is another adaptive resolution method. It addresses the problem 
of learning to achieve goal configurations in deterministic high-dimensional con- 
tinuous spaces by dynamically partitioning the space into hyper-rectangular cells 
of varying sizes, represented using a kd-tree data-structure. Parti-game takes 
advantage of memorised instances in order to speed up learning. The results 
reported by Moore j2^ indicate that instance-based methods are well-suited to 
problems in which the state-space granularity is changing. 



3.2 Memorising Learning Experience 

Instance-based learning methods such as parti-game memorise as much learning 
experience as possible. In the robotics domain, this is certainly profitable. Agents 
should learn in as few trials as possible, and learning experience is usually ex- 
pensive. In robotics, gathering experience can take several orders of magnitude 
more time than the necessary computation. Learning two things at the same 
time - evolving the state space as well as trying to find an optimal policy - 
does not necessarily take more time than learning the policy on its own for a 
given state space. If the agent is able to learn a good representation of the state 
space, it will generalise well, which in turn will speed up the learning process. 
However, this is only possible if the agent makes efficient use of the experience. 
If the learner incorporates experience merely by averaging in its current, flawed 
state space granularity, it is bound to attribute experience to the wrong states. 

In the learning methods above, we can distinguish between representational 
tools and learning paradigms 23(. The representational tools in VRDP and 
parti-game are fcd-trees. They are used to partition the state space, approximate 
the value function, and store learning instances. The learning paradigm defines 
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what the representation is used for, how training data is used to modify the 
representation, whether exploratory actions are performed, and other related 
issues. It also specifies a criterion for splitting (and possibly merging) states. 
Recently, McCallum jl^ defined such a criterion. He suggested to distinguish 
states that have different policy actions or different utilities, and merge states 
that have the same policy action and same utility. This utile distinction test 
yields a task-dependent state representation. 



3.3 Utility-Based State Distinction 



Often a perceptually aliased state that affects utility will have wildly fluctuating 
reward values. However, we cannot base a state splitting test solely on reward 
variance because some changes in reward are caused by the stochastic nature of 
the world. The agent must be able to distinguish between fluctuations in reward 
caused by a stochastic world, and fluctuations that could be better predicted 
after adding a state distinction. As suggested by McCallum 0 , the agent can 
use statistical tests based on future discounted reward to find state distinctions 
that help predict reward. 

A state model, which has been created by state distinctions, is said to be 
Markov with respect to reward if and only if future reward is conditionally inde- 
pendent of past states and actions, given the current state and action . That 
is, P(rt+i|st,at, st_i,at_i, . . . , So) = P(?'t+i|st, a*) where rt+i is the expected 
future discounted reward at time t-l-1. A state is Markov with respect to reward 
if knowledg e of past states does not help predict future reward from that state. 
McCalluml proposed to keep reward statistics associated with incoming transi- 
tions to a state. If the state satisfies this Markov property, its separated reward 
statistics are expected to be the same. If, on the other hand, there is a statis- 
tically significant difference in future discounted reward between the statistics 
depending on where the agent came from, then knowledge of which state the 
agent came from does help predict reward. In this case, the state should be split. 

It is still an open question which non-parametric statistical test should be 
used. McCallum (^] has reported promising results using the Kolmogorov- 
Smirnov test, which asks whether two distributions came from the same source. 



3.4 Incremental Learning 

Generally, changes in the environment occur unpredict ably. They could not in 
principle have been modelled, nor could their occurrence have even been an- 
nounced. Adjusting to this sort of event is an important problem for any learn- 
ing agent [20(]. Being confronted with changes in the environment, the system 
may forget what it has previously learnt. This problem is usually referred to 
as stability-plasticity dilemma - how can a learning agent preserve what it has 
previously learnt, while continuing to incorporate new knowledge. 



32 



A. Grofimann and R. Poll 



3.5 Discussion 



At this point, we want to summarise the lessons learnt from previous work on 
learning with adaptive state spaces. First, tree-structured representations such 
as Aid-trees have been used successfully to find a task-dependent segmentation of 
the agent-internal state space. As representational tools, they support incremen- 
tal development. However, tree-structured representations are rather impractical 
when they are used for continual learning, which does include the need for for- 
getting. Second, for learning with adaptive state spaces in general and for robots 
in particular, it is essential to make efficient use of experiences encountered in 
previous learning trials. To be able to access the past experienc es efficiently , 
they should be stored locally in the state-space partitions. Third, iMcCallumF s 
utile distinction test provides us with a criterion for deciding whether a partic- 
ular state distinction helps in predicting reward. It has been used in learning 
algorithms to find distinctions in a discrete sensory space jisj. 

We propose to use constructive neural networks as representational tools for 
continual learning. Ontogenic neural networks that perform a segmentation of 
the input space by unsupervised learning seem to fulfil the requirements of con- 
tinual learning better than fcd-trees. The main issue in constructiv e technique s 
is to decide when and where to add new neurons. We suggest to use iMcCallumr s 
utile distinction test as selection criterion. It is hoped that the utile distinction 
test will perform equally well for distinctions in continuous space. 



4 Multi-task Learning with Dynamic Cell Structures 



The continual-learning method proposed in this paper performs two kinds of 
learning: instance-based population of the state space with raw experience and 
dynamic programming to calculate expected future reward. Both components 
are well separated. The instance-based part is used to extract the relevant state 
space, then dynamic programmin g is perform ed only on the extracted states. 
This a pproach wa s motivated by iMcCalluml ’s U-Tree algorithm jl^. In fact, 
McCa.lliiml ’s learning paradigm together with a representational tool ~ a 



we use 



constructive neural network - that supports incremental and continual learning. 



4.1 Representation 

By value iteration, we compute payoff predictions over a discrete set of agent- 
internal states. Given that the perceptual space is usually multi-dimensional and 
continuous, there is the need to identify the individual states and to approxi- 
mate their values. For multi-task learning, it should be possible to add input 
patterns in a remote area of the input space incrementally, without affecting the 
already learnt input/output relations. Function approximators based on local 
representations can do so. On the other hand, they perform only local generali- 
sation which often prevents them from discovering global relations in the input 
space, which function approximators based on distributed representations are 
capable of. 
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In this work, we use ontogenic neural networks as function approximators. 
Unsupervised ontogenic neural networks have been Mplied successfully to vec- 
tor quantisation, data visualisation, and clustering [ 3 ]. Fritzke proposed an 
incremental self-organising network model, which is called growing cell struc- 
tures. This model is able to generate dimensionality reducing mappings. These 
mappings are able to preserve neighbourhood relations in the input data and 
have the property to represent regions of high input density on correspondingly 
large parts of the topological structure. In contrast to Kohonen’s self-organising 
feature map which serves similar purposes, neither the number of units nor 
the exact topology has to be predefined in this model. Instead, a growth process 
successively inserts units and connections. In the following, we intro duce th e 
dynamic cell structures network model, which has been adapted from iFritzkel ’s 
growing cell structures [§]. 

The model comprises a set C of nodes. Each node c G C has an associated n- 
dimensional reference vector, Wc, which indicates the centre of its receptive field 
in the input space R". A given set of nodes with their reference vectors defines 
a particular partition of the input space. The receptive field of a cell c consists 
of those points in R" for which Wc is the nearest of all currently existing refer- 
ence vectors. The cells in the network represent the agent-internal states of the 
reinforcement-learning agent. Between certain pairs of nodes there are edges indi- 
cating neighbourhood. The resulting topology is strictly fc-dimensional whereby 
k is some positive integer chosen in advance with k < n. The basic building block 
and also the initial configuration of each network is a fc-dimensional simplex, con- 
sisting of A: -I- 1 fully connected nodes. For k = 2 this is a triangle, for fc = 3 a 
tetrahedron, and for fc > 3 the structure is referred to as hypertetrahedron. See 
Figure [3l 



4.2 Algorithm 

We propose an algorithm, referred to as dynamic cell structures, that performs 
a dynamic segmentation of the input states and that keeps track of past learn- 
ing experiences. The partitions of the input space correspond to agent-internal 





Fig. 3. Fritzke’s cell structures of different dimensionality k. 



34 



A. Grofimann and R. Poli 



states, and the learning experiences are linked to the individual partitions in 
which they have occurred. 

Learning begins by finding a suitable initial segmentation of the input space. 
In the pre-learning phase, the agent collects initial experience by executing ran- 
dom actions or actions provided by a reinforcement program that acts as a 
demonstrator. The learning experience is stored as a chain of transition in- 
stances, Tt = {Tt-i, at-i,Ot,rt), where the encountered observations are then 
used as training instances for finding the initial network by unsupervised learn- 
ing as follows: 

(1) Choose an observation ot from the set of all transition instances and present 
it to the network. 

(2) Determine the best-matching unit Cb (the unit with the nearest reference 
vector): Hwh — ot\\ < ||wi — Ot\\ for all Ci € C. 

(3) Move Cb and its direct topological neighbours (those unit which are con- 
nected to Cb by an edge) towards o* by fractions Cb and e„, respectively, 
of the total distance: Wb = Cb{ot — Wb) for the best matching unit Ub and 
Wi = (n{ot — Wi) for all topological neighbours Ci of Cb- 

The action at-i in Tt was selected on the basis of observation Ot-i, which is part 
of Tt-i- Therefore, the algorithm associates each transition instance Tt with the 
best-matching node for the observation Ot-i of the predecessor Tt-\. 

The learning phase consists of the following steps: 

(1) The agent makes a step in the environment. It records the transition as an 
instance, and puts it to the end of the chain of instances. The algorithm 
stores the instance Tt with the best-matching node for the observation Ot-i 
of the predecessor Tt-\. 

(2) For each step in the world, the agent does one sweep of value iteration, with 
the nodes of the network as states: 

Q{s, a) -(r- R{s, a) + j'^T{s,a,s')V{s') 

s' 

with R(s') = max Q(s^ o)- Both R{s,a) and T{s,a,s') can be calculated 
aeA 

directly from the recorded instances. 

(3) After every I steps, the agent tests whether newly added information or 
dynamic programming (value iteration) has changed the transition utility 
statistics enough to warrant adding any new distinctions to the agent’s inter- 
nal state space. In this case, the agent selects a parent node, creates a fringe 
node in the direct neighbourhood of the parent, and uses the Kolmogorov- 
Smirnov test to compare the distributions of future discounted reward asso- 
ciated with the same action from the fringe node and any other node. If the 
test indicates that all the distributions have a statistically significant differ- 
ence, this implies that promoting the relevant fringe node into a non-fringe 
node will help the agent to predict reward and so the change is retained. 
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The distribution of the expected future discounted reward associated with 
a particular node is composed of the set of expected future discounted reward 
values associated with the individual instances in that node. The expected future 
discounted reward of instance Ti, Q{Ti), is defined as Q{Ti) = rj +7 V {C{Ti+i)). 
The node to which instance T belongs to is termed C{T). 

As parent nodes, we select nodes that have a large deviation between their 
instances’ utility values. There are many possible variations on techniques for 
selecting the parent node. A straightforward one is to sort the nodes in descend- 
ing order according to the deviation between their instances’ Q-values and to 
select the nodes in this order. 

The reference vector of the fringe node is found by unsupervised learning. 
As training instances we use only the observations associated with the selected 
parent node. During the unsupervised learning, we allow only the reference vec- 
tors of the parent and fringe node to change. The reference vectors of all other 
units are not changed in this process. 

The learning algorithm attempts to solve two problems at the same time - 
evolving the state space by adding fringe nodes to the neural network and finding 
an optimal policy by value iteration on the evolved state space. An example is 
given in the Section O and El 

5 A Control- Architecture for the Pioneer 1 Mobile Robot 

Controlling a mobile robot in the real world generally involves complex oper- 
ations for motor control, sensing, and high-level control. In order to deal with 
this complexity, robot designers often use hierarchical, behaviour-based control 
architectures. That is, the complex behaviour of the robot visible in the real 
world results from the interaction and coordination of a set of basic behaviours. 
We have used reinforcement learning to find individual basic behaviours. In the 
following, we will describe the system architecture we have used on a Pioneer 1 
mobile robot which integrates behaviour-based control, sensing, and learning. 

5.1 Control, Perception, and Learning 

The Pioneer 1 is powered by two DC motors. It can move with an approxi- 
mate maximum speed of 60 cm per second. The robot is equipped with seven 
ultrasonic proximity sensors - one on each side and five forward facing - and a 
colour CCD camera, which is mounted in a fixed position on top of the robot’s 
console. The Pioneer used in the experiments is also equipped with a 2 axis, 
1 degree-of-freedom gripper. The robot control software runs off-board, on a re- 
mote UNIX workstation, with which the robot communicates through a radio 
link. The remote software receives status updates from the robot and its sonar 
sensors at a frequency of 10 Hz. The maximum command rate for changing 
the motion direction is only 2 Hz (or less) due to limitations of the on-board 
controller. The video images are also processed off-board. The maximum frame 
rate depends on the image processing operations performed. Object detection 
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Fig. 4. The system architecture of the robot control software. Perception modules are 
on the left, control modules on the right. 

using colour segmentation is done in real-time at a rate of about two frames per 
second. 

FigurelHshows the basic architecture of the robot control software. It consists 
of several modules, which are implemented as different processes communicating 
with each other through a communication server. At the lowest level of the 
system architecture, we find the Saphira server [l^ . which runs on board the 
robot. The function of the server is to control the drive motors and the gripper 
and to collect data from the position encoders of the wheels and the sonar 
sensors. The server receives target values of the wheel velocities from the Saphira 
client running on the host computer. The Saphira client integrates a number of 
functions such as sending commands to the server, gathering information from 
the robot’s sensors, and micro-tasking. 

The robot perceives the environment through the colour camera and the 
sonar sensors. The vision system performs colour segmentation. It can detect 
objects of three distinct colours: red and blue soda cans and the green labels of 
the collection points. The position tracking system computes the robot’s position 
and orientation in the environment from previous distance readings. 

The learning system is the part of the system architecture that is responsible 
for adaptability and learning. Its task is to find action policies through reinforce- 
ment learning. As described in Section E] a POMDP agent can be decomposed 
into a state estimator and a policy. In the architecture given above, the agent’s 
observations are the readings from the sonar sensors and wheel encoders, the 
state estimator is the position tracking system, the agent’s belief is the position 
and orientation in the environment, and the policy is computed by the learning 
system. 
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5.2 Behaviours and Action Selection 

Learning a complex task from scratch is currently impractical for mobile robots. 
However, it is often possible to learn the task in stages (incremental learning) or 
to learn some part of the task. For example, a can-collecting robot could learn the 
navigation from reinforcement while all other task-achieving modules are coded 
by the robot designer. That is, we could solve a complex task by integrating 
autonomous learning and manual programming. 

The output of the learning system are actions which are to be executed by 
the robot’s actuators. In many cases where reinforcement learning is used on a 
real robot, we will need some safety mechanism. For example, while learning to 
navigate an environment, we need to ensure that the robot does not drive with 
full speed against a wall. 

Behaviour-based, hierarchical control mechanisms such as the Saphira archi- 
tecture by Safhoti et al. 1^ fulfil the requirements mentioned above. The idea 
is to decompose the control problem into small units of control. These basic 
behaviours are often relatively simple and can be added incrementally. Strictly 
speaking, a behaviour is a mapping from the internal state to control actions 
of a robot within a restricted context. Moreover, the behaviours have hierarchy 
levels assigned. An action issued by an individual behaviour will be executed if 
the behaviour is active and the action is not over-ruled by a behaviour higher 
up in the hierarchy. Actions issues by behaviours of the same hierarchy level are 
merged. 

The task of collecting cans can be decomposed as follows. The basic behaviour 
of the highest hierarchy level is Safe_Velocity_Ctrl which reduces the wheel veloc- 
ity if the distance readings of the front sonars get below a certain level. There 
are two other basic behaviours. Navigation and Gripper_Ctrl. The former will turn 
the robot towards a can to be collected or the collection point where the can is 
to be dropped, while avoiding to bump into walls. The latter will stop the robot 
and open or close the gripper at the right time. In the section on experiments, 
we describe how the Navigation behaviour can be learnt. 

5.3 Localisation and Position Tracking 

Most existing localisation methods for mobile robots make simplifying assump- 
tions about the properties of the sensors. These methods therefore work well only 
when the inherent assumptions hold for the particular robot, its behaviour, and 
its environment. Many methods, for example, assume a large number of evenly 
spaced sensors, which render them useless in robots with very few sensors. 

In comparison to mobile robots that have a ring of sonars, the sensing ca- 
pabilities of the Pioneer 1 are rather limited. At first, we tried to use a Markov 
localisation method jij. However, this approach failed. The robot became lost 
when the sonar sensor readings were sparse and noisy, for example, when the 
robot was moving diagonally through a corridor. In this situation, the walls of 
the corridor reflect the sonar beams and hardly any distance readings from the 
front sonars are correct. 
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In further investigations, we developed a localisation method jll that works 
reliably in our setup. The approach consists of three steps. First, we compute a 
two-dimensional feature space by applying a straight-line Hough transform 
to the sonar readings. Second, we perform template matching in the feature 
space by using the world map as reference pattern. Third, we use the correlation 
counts obtained in the previous step to update a position probability grid. 

This method, which is to our knowledge the only sonar-based position track- 
ing system for the Pioneer 1 to date, is less dependent on individual sonar sensor 
readings than a traditional Markov localisation approach. Rather than using the 
product of the likelihoods, p{s\L), the likelihood of observing the sensor read- 
ing s at the position L, we use the correlation count (a combined feature) for 
the computation of position probabilities. In addition, we use the detection of 
wall-like segments in the sonar data to detect situations in which the the current 
sensor information is insufficient for localisation. In those situation, the sensor 
information is not used to update the position probabilities. The method inte- 
grates a feature-based detection method with a dense-sensor matching technique 
by using the Hough transform for feature detection and a grid-based approach 
to update a distribution of position probabilities. 

6 Experiments on Continual Learning 

We have applied the RL algorithm of Section to a mobile robot learning a 
navigation task in different environments. The control architecture of the robot 
is described in the previous section. We have chosen a scenario in which the 
robot’s task is to reach a sequence of goal positions, which have been chosen 
at random. The learner receives reward when it finds the goal and punishment 
when it bumps into a wall. The algorithm should be evaluated with respect to 
its ability to segment the agent-internal state space and to adapt state space 
and policy, for example, when some features of the environment change. 



6.1 Learning Scenario 

The Pioneer 1 robot was operated in a laboratory environment. As shown in 
Figure [5l the robot arena consists of straight wall segments. For the robot, it is 
difficult to navigate due to the existence of many corners and edges. 

The input to the learning system, corresponding to the agent-internal belief 
state b, at each time step is the vector (x,y,0,a), where (x,y,9) denotes the 
estimated position and orientation of the robot in the environment, which are 
provided by the position tracking system [^, and — tt < a < tt is the angle to the 
goal with respect to the robot’s current orientation. Values a < 0 denote that 
the goal is on the left-hand side and a > 0 on the right-hand side of the robot. 

The output of the learning system are the three possible actions: turn left, 
move straight-on, and turn right. Each command is executed by the robot for 
a period of 0.1 seconds. It may take the robot up to 10 seconds to reach the 
goal position or to bump into a wall. The robot is said to reach the goal if it 
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Fig. 5. The Pioneer 1 mobile robot and its test environment. The size of the robot 
arena is 5.2 m x 4.4 m. 



comes within a radius of 40 cm to that position. In this case, the learner receives a 
reinforcement of r = 1. If the robot bumps into a wall, the performance feedback 
is r = 0, independent of the distance to the goal at the time of the collision. A 
learning trial is finished if the robot reaches the goal or bumps into a wall. In 
either case, a new goal position is chosen at random. 

The navigation behaviour which is learnt from reinforcement is later used for 
the can-collection task, in which the the angle a corresponds to the visual angle 
the object to be picked up is in, that is provided by the vision system. 

Learning a suitable policy for the navigation task is difficult. First, the inter- 
action with the environment is stochastic. The position information is subject to 
an error of up to 30 cm in a;j/-direction and 10 degree in 0-direction in the worst 
case. Given the same input to the learning system, in one situation there might 
be enough space for a full turn in front of an obstacle and in another situation 
there might not. That is, the performance feedback i? is drawn from a probabil- 
ity distribution. Second, the feedback is extremely delayed, considering that an 
action sequence can consists of up to 100 steps. To speed up the learning pro- 
cess, we use a reinforcement program that acts as an demonstrator in the early 
phase of the learning process. In fact, this is a hand- written controller, turning 
the robot to the right if the goal is on the right-hand side, turning it to the left 
if the goal is on the left, and moving it straight-on otherwise. Obviously, this 
behaviour is sub-optimal in the presence of edges in the environment. However, 
it makes sure that the learner receives reward from time to time in the early 
phase of the learning. 



6.2 Experimental Evaluation 

The agent-internal state space in the navigation task has fc = 4 dimensions. Since 
the visualisation of such a space is difficult, we have used dynamic quantisation 
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(a) Initial network. 



(b) Network in the early phase 
of learning. 
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(c) Final network for first envi- 
ronment. 



(d) Final network for second 
environment. 



Fig. 6. Dynamic cell structures are used to segment the {x, j/)-plane for the robot 
navigation task. In (c) and (d), the edges between the nodes are omitted. 



only in the ccy-plane. The segmentation for 9 and a was done manually. We have 
used 6 intervals in 9 and 4 intervals in a direction. See Figure El for an example. 
The learner starts with a small number of nodes - see Figure 6(a) (for k = 2 
input dimensions, this is 3 nodes corresponding to 3 agent-internal states) - and 
then divides the state space further by adding nodes while performing dynamic 
programming. If the learning task does not change, we are going to reach a 
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situation in which adding nodes does not help in predicting reward. That is, 
the network does not change anymore, see Figure 6(c)| After this task has been 
learnt, the environment is extended, which requires a modification of the state- 
space and action policy. Taking network (c) as initial network, the result of the 
adaptation is shown in Figure [6(d)] 

The experiments performed so far on learning the task from scratch indicate 
a speed-up of about 20 percent due to better generalisation in comparison to 
learning the same task using a predefined state space (160 xy-tiles of 30 cm 
length, 6 0-intervals, 3 a-intervals) . The corresponding neural network consists 
of about 25 nodes in the a;y-plane. 

The speed-up in multi-task learning that can be achieved depends on the rein- 
forcement learning method used. There are two classes of RL algorithms: model- 
free algorithms which learn a policy without learning a model, e.g., Q-learning 
and the TD(A) algorithm, and model-based algorithms which learn such a model 
and use it to find an optimal policy, e.g., prioritised sweeping and real-time 
dynamic programming (iQj. A continual-learning agent has typically to learn 
simple tasks first before progressing to more complex ones. This approach of 
‘shaping’ requires constant modifications of the reward landscape, which means 
that an agent implemented with Q-learning will be re-learning most or all of 
its Q-values again and again. With respect to this changing reward problem, 
model-based methods have an advantage because the agent could transfer its 
state-transition mapping to a similar domain or task easily. We are currently 
performing a detailed performance evaluation of model-based (value iteration, 
prioritised sweeping) and model-free learning paradigms (Q(A)-learning) in con- 
nection with the use of dynamic cell structures as representational tools. 



7 Conclusions 

We have argued that there is a close connection between adaptive state-space 
quantisation and continual learning. This was the motivation for developing a 
novel method for learning multiple tasks from reinforcement. This technique, 
which has been applied to a mobile robot, performs a quantisation of the agent- 
internal state space using constructive neural networks. Currently, we are in the 
process of evaluating the effectiveness of the method for multi-task learning. 
Preliminary experiments have shown that the method can be used to control 
a Pioneer 1 mobile robot collecting soda cans. The method enables the robot 
learner to re-use the policy and model of the environment learnt so far when the 
environment changes. 

The experimental results indicate that the utile distinction test can be applied 
successfully to find distinctions in a continuous sensory space if unsupervised- 
learning constructive neural networks are used as representational tools. To deal 
with the uncertainty in the robot’s sensations and actions, we have used a sonar 
sensor-based position tracking system as state estimator. Learning the navigation 
task from the raw sonar sensor readings only would not have been possible. 
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Abstract. In the field of evolutionary robotics artificial neural networks 
are often used to construct controllers for autonomous agents, because 
they have useful properties such as the ability to generalize or to be 
noise-tolerant. Since the process to evolve such controllers in the real- 
world is very time-consuming, one usually uses simulators to speed up 
the evolutionary process. By doing so a new problem arises: The con- 
trollers evolved in the simulator show not the same fitness as those in the 
real-world. A gap between the simulated and real environments exists. In 
order to alleviate this problem we introduce the concept of neuromodu- 
lators, which allows to evolve neural networks which can adjust not only 
the synaptic weights, but also the structure of the neural network by 
blocking and/or activating synapses or neurons. We apply this concept 
to a peg-pushing problem for Khepera"'^^ and compare our method to 
a conventional one, which evolves directly the synaptic weights. Simula- 
tion and real experimental results show that the proposed approach is 
highly promising. 



1 Introduction 

The Evolutionary Robotics(ER) approach has been attracting a lot of concern 
in the field of robotics and artificial life. In contrast to conventional approaches 
where designers have to construct robot controllers in a top-down manner, the 
methods in the ER approach can automatically construct controllers by taking 
embodiment{e.g. physical size of robot, sensor/motor properties and disposition, 
etc.) and the interaction between the robot and its environment into account. 



J. Wyatt and J. Demiris (Eds.): EWLR 1999, LNAI 1812, pp. 44 4601 2000. 
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In the ER approaches, artificial neural networks are widely used to construct 
controllers for autonomous mobile agents, because they can generalize, are non- 
linear and noise-tolerant | Floreano94INolfi97IBeer89|Ackley92| . Another advan- 
tage of neural network-driven robots is that a neural network is low-level de- 
scription of a controller. More precisely, it directly maps sensor readings onto 
motor outputs. Although the ER approach has the above advantages, the fol- 
lowing drawback still exist: 

As the evolution in the real world is time-consuming, simulations are used 
to evolve the controller in a simulated environment and the best individuals are 
tested in the real world. The flaw of this combined approach is that evolved 
agents in simulated environments show often a significantly different behavior in 
the real world due to unforeseen perturbations. In other words, a gap between 
the simulated and real environments exists. 

Therefore, evolved controllers should adapt not only to specific environments, 
but should be robust against environmental perturbations. Conventionally, to re- 
alize this requirement, many authors have been using techniques such as fitness- 
averaging |Reynolds94| and/or adding noise |,Tacobi95|Miglino^ in the evolu- 
tionary process. 

However, we speculate that these passive approaches are not essential solu- 
tions for the above requirements. We envision that without establishing a grace- 
ful method to realize robust controllers, we can not seamlessly transfer evolved 
agents from a simulator to the real world. 

This leaves the following question. How can robots recognize their current 
situation and regulate their behavior appropriately? 

A part of the answer may be that in the studies so fare made in ER no 
attempt was made to select directly for adaptation by changing the settings of 
the experiments. As even a simple thermostat needs sensory feedback to be able 
to control the temperature, an essential ingredient to any adaptive controller are 
controlling sensors to give the neural controller data how to change its current 
state towards the “good” one. 

To construct robust controllers against environmental changes, in this study 
we focus on creation of feedback loops and their regulation mechanisms as the 
target to be evolved instead of evolving the synaptic weights (see Figure 1). If 
we can successfully evolve the appropriate regulation mechanism, we can expect 
high robustness against environmental perturbations. 

In principle the information carried by the feedback loops can have the follow- 
ing two effects: Either one changes the weights of the synapses and the neurons’ 
thresholds or one alters dynamically the structure of the neural network itself. 
The question is how can this be done and can such methods be used to solve the 
above problem. 

Interestingly, neuroscientific results suggest that biological networks not only 
adjust the synaptic weights, but also the neural structure by blocking or acti- 
vating synapses or neurons by the use of signaling molecules, so called neuro- 
modulators |Meyrand91| . These findings stem from investigations made with the 
lobster’s stomatogastric nervous system in which certain active neurons diffuse 
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neuromodulators which then rearrange the networks. Note that the effect of a 
neuromodulator depends not only on theses substances, but also on the specific 
receptors, which are differently expressed in different cells. Imagine two cells A 
and B which expresses two different receptors C and D . A neuromodulator N will 
then only influence the cell(or synapse) which expresses the receptor A and not 
the cell with receptor B. The effect on a cell depends therefore on the interaction 
between the neuromodulator and the receptor and not just the neuromodulator 
alone. 

The release of the neuromodulators depends on the activity of the neurons 
and therefore different sensor inputs may cause different patterns of released 
neuromodulators. As such dynamic mechanisms yield remarkable adaptation in 
living organisms, the proposed approach not only carries promise for a better 
understanding of adaptive networks, but they can be also applied to real-world 
problems as we already showed in previous work [Ishiguro99lKondo99| . 

In this study, we apply the proposed concept to a peg-pushing problem for 
Khepera^^ and compare our approach to a conventional one, which evolves 
directly the synaptic weights. As there exists no theory about how such dynamic 
neural network can be constructed, the evolutionary approach is the method 
of choice to explore the interactions between the neuromodulators, receptors, 
synapses and neurons. 



Feedback loops 




Creation and regulation of 
the feedback loops are the 
targets to be evolved 



Fig. 1. Feedback loops between the robot and its environment. 



2 Related Work 

Although other authors tried to overcome the gap between simulations and 
real-world environments by fitness-averaging [Reynolds94| and/or adding noise 
j,Tacobi95|Miglino95| , these methods are quite passive and do not allow to tune 
the behavior specifically using differences in the sensory input. 
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Floreano et al. evolved neural controllers by assigning learning rule to ev- 
ery synapse | |Floreano96| . In contrast, our approach allows to evolve learning 
rules, which can correlate not only the activity of neighboring neurons and their 
synapses, but can correlate neuronal activity from distant cells with synapses 
depending on which neuromodulators are diffused in a given situation. Further- 
more, by introducing a blocking function, the neuromodulators can also dynam- 
ically rearrange the structure of the neural network. 

Husbands et al. introduced a gas model (nitric oxide), which allows to modu- 
late synapses |Husbands98| . The main difference to our approach is that we use 
specific receptors to locate an effect on a neuron or a synapse. Using the receptor 
concept a cell can diffuse the neuromodulator to all cells, but only those, which 
have the corresponding receptor will be changed by the neuromodulator, all the 
others rest unchanged. 



3 Lessons to Be Learned from the Lobsters’ 
Stomatogastric Nervous System 

3.1 Dynamic Rearrangement 

Investigations carried out on the lobsters’ stomatogastric nervous system suggest 
that biological nervous systems are able to dynamically change their structure 
as well as their synaptic weights |Meyrand91| . 

This stomatogastric nervous system mainly consists of an oesophageal, a py- 
loric, and a gastric network. Normally, these three individual networks show 
their own independent oscillatory behaviors, but in the moment a lobster is eat- 
ing the networks are integrated and reconstructed to a new one, the swallowing 
network, in which certain neurons and connections are excluded and formerly 
inactive connections are activated (see Figure |2). 




Fig. 2. Dynamic rearrangement of a lobster’s stomatogastric nervous system. 



Recent studies in neurophysiology showed neuromodulators(hereafter: NMs) 
play a crucial role to regulate this remarkable phenomenon(e.g. changing prop- 
erties of synapses as well as neurons). 
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3.2 Neuromodulators 



NMs are substances that can dynamically influence several properties of neu- 
rons and therefore the function of a neural network. In contrast to neuro- 
transmitters(NTs) the effect of NMs spreads slower and lasts longer. NMs change 
the processing characteristics of neural networks by affecting the membrane po- 
tential, the rate of changing the synapses(i.e. influence on learning mechanisms) 
and other parameters. Typical NMs are acetylcholine, norepinephrine, serotonin, 
dopamin{all are also used as NTs), somatostatine and cholecystokinine(hoth also 
used as hormones in the human body) and many small proteins. Although these 
substances are released in a less local manner than NTs, the effects can be quite 
specific. This specificity comes from specific receptors on the neurons and their 
synapses. 

These NMs stem either locally from the neural network itself or from spe- 
cific sub-cortical nuclei. The local release of NMs depends on the activity of the 
local neural network itself. On the other hand, sub-cortical nuclei as the locus 
coeruleus {noradrenergic innervation), the ventral tegmental &ve&{dopaminergic 
innervation) or the basal fore-brain miclei{cholinergic innervation) send neu- 
romodulatory axons to cortical structures to release NMs from axonal vari- 
cosities which is called volume transmission. Many publications in neuroscience 
show the importance of NMs for dynamic rearrangement of neuronal modules 
|Meyrand9HHooper89 or for learning and memory (switching between learning 
a.nd reca.II model [Ha,sse1mo95] . 

In this study we implemented the following properties: 



— dynamic change of the threshold of a neuron 

— dynamic blocking of neurons and synapses 

— dynamic change of the inhibitory or excitatory properties of a synapse 

— dynamic modulation of the synaptic weights (i.e. learning). 



4 Proposed Method 

4.1 Basic Concept 

The basic concept of our proposed method is schematically depicted in Figure 
|3] As in the figure, unlike the conventional neural networks, we assume that 
each neuron can potentially diffuse its specific (i.e. genetically-determined) NMs 
according to its activity, and each synapse has receptors for the diffused NMs. We 
also assume that each synapse independently interprets the received NMs, and 
changes its properties(e.g. synaptic weight). By selecting for regulatory feedback 
loops(cyclical interaction between the diffusion and reaction of NMs), we expect 
to be able to evolve adaptive neural networks, which show a seamless transfer 
from simulations to the real world(in the figure, the thick and thin lines denote 
the connections being strengthened and weakened by NMs, respectively). 

In summary, in contrast to the conventional ER approach that evolves synap- 
tic weights and neuron’s bias of neuro-controllers, in this approach we evolve 
the following mechanisms: 
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diffused NMs 



from s 




dynamic rearrangement 



to actuators 



emerged neural network 



Fig. 3. Basic concept. 



— Diffusion of NMs(when, which type of NMs are diffused from each neuron?) 

— Reaction to NMs(how do the receptors on each synapse interpret the received 
NMs?) 

To determine the above parameters, we use a Genetic Algorithm(GA). 

4.2 Application Problem 

In this study, we use a peg-pushing problem as a practical example. Figure 
schematically shows this problem. As in the figure, in this environment there 
are three objects: the light source, peg and robot. Here the task of the robot 
is to push the peg toward the light source. Due to the rudimentary stage of 
investigation, the following are not taken into account: 

— Gollision to the surrounded walls (in such a case, the trial is terminated). 

— The robot contacts with the peg at the initial position. 

Figure El illustrates the structure of the robot used in this experiment. This 
robot is a simulated Khepera^^ robot, equipped with six infra-red sensors in 



0 Light source 




Robot 



Fig. 4. A peg-pushing problem. 
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Fig. 5. Structure of the robot. 



the frontal side, and light direction detector, and two DC motors. In this study, 
the following equation is used to simulate the property of the infra-red sen- 
sor |Hoshino94| : 



Sensor jvalue = , — , (1) 

1 + exp(0.65(i — 14cos(1.3of)) 

where d represents the distance between the sensor and peg, and a denotes the 
angle between the direction of the sensor heading and the detected object. 

On the other hand, the light direction detector can perceive the direction of 
the light source in three ranges as (see Figure [^: 



Light-Left = •! 
Light-Center = 
Light-Right = 



1.0 18° < 6» < 0° 

0.0 • • • otherwise 

f 1.0 9° < 6» < 9° 

( 0.0 • • • otherwise 

( 1.0 - ••0° < 6» < 18° 

( 0.0 • • • otherwise. 



( 2 ) 



4.3 Controller 

Figure El shows the neural network structure for the robot controller. As in the 
figure, this controller consists of nine sensory neurons(i.e. six infra-red, and 
three light direction sensors), five hidden neurons and four motor neurons(i.e. 
CW/CCW motor neurons for the right and left motors). 
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Sensory neuron 



IR Sensor (L90) 
IR Sensor (L45) 
IR Sensor (LIO) 
IR Sensor (RIO) 
IR Sensor (R45) 
IR Sensor (R90) 
Light Sensor (L) 
Light Sensor (C) 
Light Sensor (R) 
Bias 




Fig. 6. Structure of the controller. 



4.4 Diffusion of NMs 

Figure 0 schematically illustrates the diffusion process. As in the figure, each 
neuron can diffuse its specific(i.e. genetically-determined) type of NMs in the 
case where its activation value is within the specific threshold values(these are 
also genetically-determined) . 

Here, for simplicity, we assume that each neuron can diffuse at most one type 
of NMs. Furthermore, we set the number of NM types to two. 



Sensory neuron 
IR Sensor (L90) — 

IR Sensor (L45) — 

IR Sensor (LIO) — 

IR Sensor (RIO) -►(j^ 

IR Sensor (R45) — 

IR Sensor (R90) 

Light Sensor (L) — ^ ^ 
Light Sensor (C) — 

Light Sensor (R) — 

^ n n 



Hidden neuron 




Motor neuron 

Right motor (CW) 



NM type: NM2 
Lower threshol d 
I Upper threshold 






lotor (CCW) 
)lor (CW) 
>tor (CCW) 



< 



1.0 activation 



NM diffusion area 



Fig. 7 . Diffusion of the NMs. 



4.5 Reaction to the Diffused NMs 

As in Figure [B1 each synapse and neuron has receptors for the diffused NMs. 
Once the NMs are diffused from certain neurons, these NMs may change the 
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Sensory neuron 



IR Sensor (L90) 
IR Sensor (L45) 
IR Sensor (LIO) 
IR Sensor (RIO) 
IR Sensor (R45) 
IR Sensor (R90) 
Light Sensor (L) 
Light Sensor (C) 
Light Sensor (R) 
Bias 




Fig. 8. Reaction to the diffused NMs. 



property of each synapse and neuron. This change is carried out according to 
the genetically-determined interpretation. Detailed explanation is given below. 



Dynamic change of synaptic property. Table [Urepresents an example of the 
receptor interpretation for the change of synaptic property. Due to the number of 
NM types(i.e. two), all possible combination of the received NMs is four, in this 
case. Each case can take one of the following four types of modulation; Hehhian 
learning {Rij{N M)=1.0), anti-Hehhian learning {Rij{NM)=-1.0), non-learning 
{Rij{NM)=0.0), and bloekingii.e. excluding the synapse), respectively. 

Here, we use the following equation for the dynamic modulation of the synap- 
tic weights. 

= wlj + rj ■ Rij{NM) ■ ai ■ Uj, ( 3 ) 

where, rj is the leaning rate, and Rij(NM) is the parameter which determines 
how the synapse concerned interprets its received NMs and modifies its weight 
(see above). 



Table 1. An example of the NM-interpretation table for the change of synaptic prop- 
erty. 



NMi 


NM2 


Type of modulation 


0* 


0 


Hebbian learning 


0 


1 


non-learning 




0 


blocking 


1 


1 


anti-Hebbian learning 



* 0 means NMn is not diffused. 
** 1 means NM„ is diffused. 
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Table 2. An example of the NM-interpretation table for the change of neuronal prop- 
erty. 



NMi 


NM2 


Type of modulation 


0* 


0 


active 


0 


1 


inactive 




0 


inactive 


1 


1 


active 



* 0 means NM^ is not diffused. 
** 1 means NM„ is diffused. 



Dynamic change of neuronal property. Table [2] depicts an example of the 
receptor interpretation for the change of neuronal property. Each case can take 
one of two state; active and inactive. This assignment is determined through the 
evolutionary process. 

In the following, we show simulated and experimental results, and compare 
the proposed method to a conventional one, which directly evolves the synap- 
tic weights. Due to the rudimentary stage of investigation, blocking/activating 
synapses and neurons, and receptors on neurons are not implemented here. In- 
corporating these mechanisms is currently proceeding. 

5 Simulation Results 

5.1 Encoding 

As illustrated in figured the genetic information has two main parts: a diffusion 
part and a reaction part (the genetic information is encdoded in a binary string). 
The diffusion part encodes the types of neuromodulators a neuron can release 
and when(determined by two thresholds) a neuromodulator will be released. The 
diffusion part contains 18 blocks of information for the 18 neurons used in our 
neural network. The thresholds are encoded as 8 bits whereas for the NM type 
2 bits are used. 

The reaction part determines the receptor types(up to two) of a neuron or 
synapse and which function(block, learning and non-learning) will be applied in 
the case one or two receptors become active. As we use for the moment just two 
receptors in our model, we need two bits to encode the information contained in 
Table [H 



5.2 Evaluation Criterion and Conditions 



In order to verify the validity of our proposed method, in the following we com- 
pare it with the conventional ER approach where the synaptic weights are the 
target to be evolved. The evaluation criterion used for the evolutionary process 
is: 



fitness 




dist(light,pegend) ^ 
dist{light, peg start) j 



(4) 
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Diffusion part Reaction part 



12 I 18 


12 j 74 1 




























0^ 6y NM type 



0^ : lower threshold 
6[j : upper threshold 
NM type: NMl/NM2/none 




TL(0, 0) 


TL(0, 1) 


TL(1, 0) 


TL(1, 1) 



TL(0, 0) : type of learning in the case of NM(0, 0) 
TL(0, 1) : type of learning in the case of NM(0, 1) 
TL(1, 0) : type of learning in the case of NM(1, 0) 
TL(1, 1) : type of learning in the case of NM(1, 1) 

type of learning : Hehbian/anti-Hebhian/none 



Fig. 9. Encoding scheme for the dynamically-rearranging neural network. 



Table 3. Parameters used in the simulation. 



Learning rate (g) 


0.1 


Population 


100 


Generation 


200 



where, dist{light^ peg start) is the distance between the light source and the ini- 
tial peg position. Similarly, dist(light,pegend) represents the resultant distance 
between the light source and the resultant peg position. 

Each individual is allowed to move during 500 time-steps in the environ- 
ment shown in Figured Furthermore, each individual is tested ten times from 
the different random heading direction. Thus, the averaged fitness value of each 
individual is used for the evaluation. The parameters used in the following sim- 
ulations are listed in Table |3] 

During the evolutionary process, tournament selection scheme, uniform 
crossover and random mutation are adopted. 

5.3 Results 

Figure [m] shows the resultant fitness transition of both methods. As in the figure, 
both methods finally obtain high scores as the evolutionary process iterated. 
Figure [TT] depicts typical trajectories of the best agents obtained in the final 
generation. From the figures, it is understood that both agents can successfully 
carry the peg toward the goal. 
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Fig. 10. Comparison of the fitness transition. 



Conventional method Proposed method 

Fig. 11. Resultant trajectories of the best agents in both methods. 



5.4 Evaluation of Adaptability 

As mentioned earlier, the evolved agent should be not only adapted to the given 
environments during the evolutionary process, but also be robust against envi- 
ronmental perturbations. Without this we can not seamlessly tranfer the evolved 
agents from the simulated to the real worlds. In the following, we show the quan- 
titative comparison of adaptability of both methods. 



Robustness against the specific environmental perturbations. In order 
to verify the robustness, we observed the behavior of the evolved agents in the 
following test environments: 

Table m shows the results of these tests. Each result is represented by the fit- 
ness value averaged over one hundred trials. From these results, it is understood 
that the proposed method yields remarkable high scores while the conventional 
method shows serious degradations. 
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Testl: Add 10 % slip to the right motor output. 
Test2: Add 10 % slip to the left motor output. 
Test3: Add 0.2 degree to the direction of 
the resultant peg movement. 

Test4: Add -0.2 degree to the direction of 
the resultant peg movement. 



Table 4. Quantitative comparison of the robustness against the specific environmental 
pertnrbation. 





Conventional 

method 


Proposed 

method 


Withont Perturbation 


83.0 


96.1 


Test 1 


66.6 


92.6 


Test 2 


70.3 


89.8 


Test 3 


71.5 


79.6 


Test 4 


77.7 


80.1 



Robustness against the alteration of the peg size. Figure [12] shows the 
result of the comparison of the robustness against the alteration of the peg size. 
In the figure, horizontal axis represents the magnification of the peg size(e.g. 
0.5 means a half size of the original peg). From this result, we can see that the 
proposed method can keep high fitness value over these alterations compared 
with the conventional method. 




Fig. 12. Quantitative comparison of the robustness against the alteration of the peg 
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Table 5. Quantitative comparison of the robustness against the random noise. 





Conventional 

method 


Proposed 

method 


Without Noise 


83.0 


96.1 


With Noise 


75.1 


92.2 



Robustness against random noise. We furthermore observed the resultant 
behaviors of both methods by adding 5% and 3% random noises based on the 
Gaussian distribution onto each motor output and direction of pushed peg move- 
ment, respectively. 

Table |5]shows the results of this test averaged over one hundred trials. From 
the table the proposed shows high adaptability. Figure[T3]is a typical example of 
the resultant trajectories of both methods. The robot driven by the dynamically- 
rearranging neural network can successfully reach the goal. Figure [14| is an ex- 



o 




Conventional method 




Proposed method 



Fig. 13. Typical resultant trajectories. 




Fig. 14. An example of the transition of the synaptic weights. 
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ample of the transition of the synaptic weights. In the figure, for convenience 
several synapses are selected. Interestingly, some synapses drastically change 
their weights from excitatory to inhibitory and vice versa, and others settle to 
certain values. From this result, we speculate that the neural networks suitable 
for the current situation are emerged by appropriately changing the correlation 
among neurons. Detailed analysis is currently under investigation. 

6 Experimental Results 

As mentioned in the previous section, the proposed dynamically-rearranging 
neural network shows high adaptability against the environmental perturbations 
compared with the conventional ER approach where the synaptic weights are 
the target to be evolved. This implies the proposed method would be highly 
promising to reduce the serious gap between simulated and real environments. 

In the following, we investigate how the evolved controllers in both methods 
work after tranferring onto the real mobile robot. Figure [15] and [16| show the 
real robot and its experimental setup, respectively. Figure [17] depicts the exper- 
imental results of the evolved agents which are identical to the ones in Figure 
10. Unlike in the simulated environments, the agent based on the conventional 
method cannot successfully carry the peg. On the other hand, the agent of the 
proposed method show the same function as in the simulation. 

7 Conclusion 

In this paper, we proposed a new adaptive controller for autonomous mobile 
robots by incorporating the concept of dynamic rearrangement of neural net- 
works with the use of neuromodulators. 

We investigated the adaptability of the proposed method in comparison with 
the conventional ER approach by carrying out simulations. The obtained results 
are encouraging. Detailed analysis of the obtained neural networks(e.g. how do 
the NMs change the correlation among neurons?) are currently under investiga- 
tion. 





Light det^or 


Infrared sensors 


!T \ 

Wheel 



Fig. 15. Experimental robot {KheperaF^). 
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Fig. 16. Experimental setup. 




Conventional method Proposed method 



Fig. 17. Experimental results. 



We are also concurrently investigating the implementation of the proposed 
dynamically-rearranging neural networks by FPGAs(Field Programmable Gate 
Arrays). Full implementation of the evolved dynamically-rearranging neural net- 
works is currently proceeding. 
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Abstract. Finding redundancy with respect to a given task in an en- 
vironment is important ability for a robot to dynamically assign the 
degrees of freedom to several tasks. In this paper, an adaptive controller 
is proposed that does not only estimate appropriate control parameters 
but also can find redundancy with respect to the given task dynamically. 
The controller is derived based on a least-mean-square method. A sim- 
ulation result of a camera-manipulator system is shown to demonstrate 
that the proposed method can find redundancy automatically. 



1 lutroductiou 

General robots nowadays, such as industrial robots, have minimum number of 
degrees of freedom and minimum variety of sensors only sufficient for presup- 
posed situations. Such robots can achieve restricted tasks properly in restricted 
environments. However, they cannot deal with unexpected situations appropri- 
ately since they are equipped with only necessary and sufficient numbers of 
actuators and sensors. 

A robot, as a universal machine, ought to have adaptivity, ability to esti- 
mate appropriate control parameters and/or structure to achieve a given task 
in an environment. So as to have such adaptivity against changes of task and 
environment, a robot needs to have larger number of actuators and more variety 
of sensors. Such redundancy even can allow us to design a robot so that new 
behaviors can emerge [T]. In this sense, we have to focus on a multi-DOF robot 
that has many actuators and a variety of sensors. 

In figure (2a) , a general control architecture for a less-DOF robot is shown. 
Sensor data is compared with a given task and is fed into the controller. Then, 
the controller calculates commands for actuators. Just by extending this way, 
we can derive a controller for a multi-DOF robot (figure [T](b)) although this can 
only achieve a single task. The robot may have more ability if it can achieve 
several tasks at a time. In addition, the robot will be more adaptive if it can 
dynamically assign its degrees of freedom to each task since this means the robot 
can estimate not only appropriate parameters but also an appropriate control 
structure(figure (TJc)). To assign degrees of freedom dynamically, it is important 

* This work is partly done during his stay in the University of Ziirich. 
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task 



"Q- 

n 



controller 




robot 



actuators 



(a) A control architecture for a robot with several degrees of freedom 



environment 



sensors 




(b) A control architecture for a robot with more degrees of freedom although it cannot 
find redundancy automatically 




(c) A control architecture for a robot with more degrees of freedom with ability to find 
redundancy and to spare it for another task 



Fig. 1. A control architecture for a less-DOF robot (a) is extended for a multi-DOF 
robot although it can achieve only one task (b). If the controller can assign DOF to 
the given task dynamically, it can spare the remainder for another task. 
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for the controller to be able to find redundancy with respect to the task on-line 
since necessary number of degrees for a task depends on the environment where 
it is given. Even in case of malfunction of several actuators, the robot can assign 
unbroken actuators to the task, and the task will be achieved. 

There has been a certain amount of work trying to make an adaptive robot. 
In the field of control theory, they have developed adaptive control schemes 
which can estimate parameters but cannot find redundancy 0. There are 
also several attempts to make an adaptive robot by a control architecture such 
as a subsumption architecture 0 and a schema system |^, each of which also 
cannot utilize redundancy of the robot effectively. 

Our final goal is to build an autonomous adaptive robot that has many 
degrees of freedom and sensors and that can utilize redundancy to adapt to a 
dynamic environment. We have proposed to use a hybrid structure of adaptive 
controllers for multi-DOF robots In these papers, each controller can 

estimate parameters, but cannot find redundancy with respect to a given task, 
and therefore cannot utilize it. 

In this paper, an adaptive controller is proposed which estimates not only 
parameters but also sufficient degrees of freedom for a task. The remainder of 
this article is organized as follows. First, the adaptive controller is introduced 
based on a least-mean-square method. Then, by utilizing UD factorization of 
the covariance matrix, a method to derive the redundancy is proposed. Finally, 
a simulation result is shown to demonstrate the effectiveness of the proposed 
method. 

2 Adaptive Controller That Can Find Redundancy 

2.1 Task Definition and Feedback Controller 

The i-th task for the robot is assumed to be defined in the f-th sensor space: 
Xi — >■ Xid- The class of tasks is limited by this definition, but still a certain 
amount of tasks can be defined in this way: for example, to track a moving 
visual target [S], to keep balance of a legged robot [7], to exert force towards 
constraint surface |5], to maximize manipulability m, and so on. 

The sensor output Xi is a function of actuator displacement 9, 

x, = x,{9). (1) 

Differentiating eq. HD with respect to the sampling rate, we can get 

Ax,{k) = J,{k)A9{k), (2) 

in the k -th sampling step, where Ji is a Jacobian matrix (= dxijdO"^). Assume 
that all the actuators of the robot are velocity-controlled, and we can derive a 
velocity command for the i-th task from eq.(I5J, based on well-known resolved 
rate control m 

u{k) = J,{k)'^K,{xu{k) - x,{k)) + (I - J,(A:)+ J,(A:))|, 



(3) 
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where I, Ki and ^ are an identity matrix, a feedback gain matrix, and an arbi- 
trary vector that denotes the redundancy of the controller with respect to the 
given task to follow Xid- A'*’ denotes a pseudo-inverse of a matrix A. By utilizing 
the last term in the right hand side of eq. we can make a hybrid structure 
of controllers |^. If the robot and the environment are well-calibrated and do 
not change dynamically, the Jacobian matrix J i[k) is known, and therefore, the 
resolved rate controller © can achieve the task. In case they are not calibrated, 
or the environment changes dynamically, the matrix J i{k) is unknown, and we 
have to estimate it. 



2.2 On-Line Estimation of Jacobian Matrix 



To estimate the matrix J i{k) the authors have proposed a method based on the 
least-mean-square method [H]: 



J,(fc) = J,{k-l) + {Ax,{k)-J,{k-l)Ae{k)} 



Ae{kf P{k - 1) 
p + A6{kf P{k - l)A6{ky 



( 4 ) 



where (0 < p < 1) is a forgetting factor, and P{k) is a covariance matrix 
calculated as 



P{k) 



1 [p(k - 1 ) - P{k-i)Ae{k)e{kfp{k-i) \ 
p+ A9{kf P{k-l)A9{k) ) 



( 5 ) 



We have already demonstrated that many kinds of robots can be controlled 
utilizing the estimator (El) and the feedback controller m without any a priori 
knowledge on the robot nor on the environment |2]. 



2.3 Estimating Redundancy 

Utilizing a characteristic of the estimator it can find the redundancy with re- 
spect to the task. If the robot is redundant to achieve the i-th task, the variation 
of the actuator displacement A9 is not uniform over the task space. Therefore, 
by observing the covariance matrix P which represents the variation of A9 the 
controller can find redundancy. 

It is known if the variation of A9 is not uniform, the covariance matrix P 
is no more positive, and therefore the least-mean-square based estimator will 
lose numerical stability To avoid such instability, a technique so called 
UD factorization is proposed [12j . which we see more details in the appendix. 
Following the method, the covariance matrix P is factorized 

P = UDU^, (6) 

where U is an upper triangular matrix whose diagonal elements are all 1, and 
D is a diagonal matrix. Instead of updating P by eq.(El), we update U and D 
(see appendix) to ensure the positiveness of the matrix P and therefore, we can 
ensure the numerical stability. 
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Note that if the positiveness of P is weak, one of the diagonal elements of D 
is nearly 0. Therefore, if the control input for the robot AO is not uniform, we 
can find it by observing D. Let D and U be 



D — diag [ di 



U = 



1 * 



_0 1 _ 

= [m1 • • • M„] 



(7) 



( 8 ) 



respectively. If the robot is redundant with respect to the task, a diagonal element 
dj of D becomes very small. If |dj| < e (e is a considerably small number), the 
corresponding direction of AO is not anymore effective to the estimation, that 
is, the direction is redundant with respect to the task. We can easily find that 
in eq. every term concerning about AO is multiplied with P. Since the term 
PAO is factorized. 



PAO = UDWAO 



= U 



AO, 



the redundant direction is given as a column vector of L? To eliminate the 
redundancy, we remove this direction from the estimated matrix Ji. 

3 Simulation 

We show a simulation result to demonstrate that the proposed scheme can find 
redundancy with respect to one task, and that the robot can apply another task 
by utilizing the redundancy. 

A robot used for the simulation has 4 DOFs with two cameras depicted in 
figure [2 By two cameras the robot observes two positions of image points in the 
image planes. The first task given to the robot is to keep these positions constant 
in the image planes, that is, a visual servoing task. The points are moving along 
circles on a plane whose diameter is 0.1[m] in 5[s], 10 times. 

Two cameras are gazing at 2 points, therefore the image feature vector is 
a; G 5R®. The robot has 4 DOFs, 6 G Two points are moving along the circle 
on a plane, therefore the DOF needed for the task is 3. 

The forgetting factor of the estimator p = 0.8. The initial matrix of P is an 
identity matrix. The initial matrix for the estimated Jacobian matrix is given as 
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Fig. 2. A 4 DOF manipulator with 2 cameras is used for the simulation (Unit is [m]). 
Each of visual targets(two points) is moving in a circle on a plane. 



an arbitrary matrix 



J(0) = 



100 0 0 0 
0 100 0 0 
0 0 100 0 
0 0 0 100 
100 0 0 0 ’ 
0 100 0 0 
0 0 100 0 
0 0 0 100 



because we assume that the controller does not have any a priori knowledge on 
the parameters of the robot and of the environment. Note that the rank of this 
matrix is 4. This means that all the 4 DOFs are utilized for the visual servoing 
task at the beginning of the control period. 
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To help understanding of the readers, we show the approximate matrix of 
true Jacobian: 



2300 


0 


0 


0 


100 


0 


1300 


1300 


2400 


300 


200 


0 


0 


0 


1300 


1300 


2400 


-300 


-200 


0 


100 


0 


1300 


1300 


2300 


0 


0 


0 


0 


0 


1300 


1300 



Note that the rank of this matrix is 3. 

Here we show two simulation cases: 

case 1 applying visual servoing control without the redundancy estimator, and 
case 2 applying visual servoing control with the redundancy estimator. 

In case 1, the estimator cannot find redundancy with respect to the visual servo- 
ing task whereas it can in case 2. In case 2, utilizing the redundancy, the second 
task to make 9 ^ converge to — tt/3 is applied. It is performed by minimizing the 
index 

g = {03-(-^/3)}2. 

A simulation result is shown in figures E] and U From figure E] we can say 
that the performances in the image planes in both cases are almost the same 
(hardly distinguishable) . In both cases, the visual servoing task is achieved with- 
out knowing about structure and parameters of the robot and the environment 
by utilizing the on-line estimator. In figure El around 3 [s], the proposed estima- 
tor finds the redundancy with respect to the visual servoing task. After that, 63 
is converging to — tt/3 utilizing the redundancy (see the dashed line) whereas the 
controller cannot utilize redundancy in case 1. 

4 Conclusions and Discussion 

In this paper, an adaptive control scheme has been proposed that can estimate 
not only parameters but also sufficient degrees of freedom to achieve a given task. 
It can find redundancy with respect to the task by itself and can spare it for 
other tasks. By combining such controllers, it is easy to construct a hybrid control 
architecture for a robot that has many actuators and sensors. At first, only the 
first task controller is working, then it finds redundancy with respect to the first 
task. It will spare the found redundancy for the second task controller, and so 
forth. Note that the first controller does not simply inhibit the lower controllers, 
but spares the found redundancy for them. As a consequence, a behavior is 
emerged by a combination of several controllers without any supervisor. 

For example, to control a humanoid that has many degrees of freedom and 
sensors, the effort to program all the degrees grows enormous. In addition, the 
environmental constraints are difficult to be predicted. For such a robot, the pro- 
posed method is supposed to be very powerful. One should prepare several tasks 
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Fig. 3. Simulation result 1 : image error in camera 1, image feature 1, x- coordinate 

{Xi - Xid) 




Fig. 4. Simulation result 2 : joint 3 displacement 63 



to be achieved, and apply the proposed scheme to each task. Since each con- 
troller will automatically find redundancy in the relation with the environment 
and spare it for other task, explicit coordination is not needed. 

In the proposed method to construct the hybrid control structure, it is nec- 
essary that each of tasks must be defined by the desired sensor output Xd (or 
Xd{t), a function of time), which obviously limits the applicability of this method. 
However, a certain amount of tasks still can be defined in this way: for example, 
to track a moving visual target [^, to keep balance of a legged robot |7], to exert 
force towards a constraint surface [^, to maximize manipulability m, and so 
on. This means there are a certain amount of behaviors that can be emerged by 
the proposed method. 
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A UD Factorization [1 2] 

If the initial UD factorization of P is given, updated matrices U and D are 
obtained from U and and D: 

f = U^A0, 

Vi = difi {i = 
ai = p-hvifi, 
di = di^jai, 
bi = vi, 



1, - ■ ■ ,n, nis dimension of P) 



70 



K. Hosoda and M. Asada 



As for the j = 2,---,n -th elements, they are updated as follows, where = 
denotes overwite in the program manner, 

dj = djOtj—ijcXjj 





Pj ~ 






U^J 


= Uij + biPj 1 

A, 


■■J)- 


(9) 


h 


= bi + UijVj ) 


, A 
cii — 


di/p 


■ ,n). 


(10) 
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Abstract. This paper investigates a method for collecting examples it- 
eratively in order to refine the results of induction within the framework 
of inductive logic programming (ILP). The method repeatedly applies 
induction from examples collected by using previously induced results. 
This method is effective in a situation where we can only give an inaccu- 
rate teacher. We examined this method by applying it to robot learning, 
which resulted in increasing the refinement of induction. Our method 
resulted in the action rules of a robot being learned from a very rough 
classification of examples. 

Keywords: robot learning, ILP, collection method of examples, inaccu- 
rate teachers. 



1 Introduction 

Learning control rules of mobile robots is an important application in the field of 
machine learning (ML), in which many methods are being investigated. A direct 
application of concept learning to induce control rules of real mobile robots is 
difficult for reasons that include noisy sensor information of robots, and real 
time processing. Another interesting and important difficulty of robot learning 
is that we cannot have an accurate teacher for learning. We can have a model of 
robot behavior for short periods or step-wise behavior, but it is difficult to say 
how behavior affects a robot’s situation after a while. A teacher must indicate 
if a robot’s action is appropriate for a sequence of actions and even for a total 
goal. We cannot obtain this kind of teacher; that is, we cannot give a model 
that classifies an action as correct or incorrect. This paper proposes a learning 
procedure using an inductive logic programming (ILP) framework to learn from 
an inaccurate teacher. 

ILP has been intensively studied because it can be applied to complex and 
structural domains, where the first-order language is necessarily to express ob- 
jects. Instead of the merit, we can find that the ILP is not very good at the 
domains of robot learning. Even for simple tasks, such as following a visible 
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target, which we describe in this paper, it is not easy to acquire appropriate 
rules by ILP method. This is an evidence that ILP method should be studied in 
areas, such as robot learning, which is suffered from the problem of noise and 
uncertainty. We describe learning in such areas as learning from an inaccutate 
teacher, which we deal with using ILP framework. 

For learning from an inaccurate teacher, we introduce a procedure to col- 
lect examples, the procedure which is called an iterative collection of examples 
(ICE) from experiments with robot learning. ICE repeats the routine of example 
collection and induction. Example collection is carried out by using previously 
induced action rules of robots. We will show that action rules induced from ex- 
amples collected by ICE, incrementally refine a robot’s behavior. In the next 
section, we explain the situation of inaccuracy in our problem. Section 3 intro- 
duces the proposed example collection procedure, ICE, and explains its expected 
effect. Section 4 introduces our experimental environment of robot learning and 
our ILP problem in this environment. In Section 5, an application of the ICE 
method to the problem is given, and the results are shown in Section 6. Finally, 
Section 7 presents some concluding remarks and a discussion of related work. 

2 Inaccurate Teachers in Robot Learning 

In supervised learning, which includes an ILP framework, a teacher is neces- 
sary for classifying cases to be true or otherwise. In some fields of application, 
however, it is very difficult to provide a teacher, and we can only prepare an 
inaccurate teacher. Learning for robot control is such a field. It is difficult to 
classify the actions of a robot at each step. 

We define an ideal teacher as an oracle that says yes if and only if a given 
instance of a target relation is appropriate. We can say a teacher is inaccurate if 
the teacher is not ideal; that is, the teacher sometimes says yes for an instance 
even if it is inappropriate, or it says no even if the instance is appropriate. 

In robot learning, we face the problem of inaccurate teachers. The reasons 
why we can only give an inaccurate teacher include: 

1. We can get only time- local information and can give only time-local actions 
at the moment. 

2. A global goal (to reach a goal position) cannot be achieved by an action but 
by a sequence of actions. 

3. It is difficult to plan a sequence of actions either for a goal or for a mass of 
actions, such as avoiding collisions and turning a corner. 

From these reasons, we can only give a plausible model to detect preferable 
actions. A reasonable construction of a teacher for this situation is obtained by 
combining two teachers. 

1. (A chatty but erroneous teacher) A model to say yes for appropriate 
cases and for some inappropriate cases. This can be constructed by weak 
conditions on a state of a robot at each moment. 




Learning Robot Control by Relational Concept Induction 



73 



Input T : a teacher 

N : the number of examples to be collected 
B : background knowledge 

{Eq,Eq) random-collection(T, A^) ; i := 0 
Repeat 

Induce a logic program Pi from [Ef , E~ ) with background knowledge B 
{AEf , AE~) guided-collection(T, Pi, N) 

Ef+^ ■.= E+ U AE+ ■ P- 1 := E- U AE~ ■, i ■.= i + 1 

EndRepeat 

Fig. 1. Induction algorithm using the iterative collection of examples (ICE). 



2. (A taciturn but reliable teacher) A model that says no only for critical 
cases. This can be constructed by watching collisions, for example. 

Normally, we can prepare these teachers, because the former teacher can be 
constructed from a naive understanding of behavior, and the latter can be given 
based on a global goal. 

3 Iterative Collection of Examples 

We propose an algorithm, induction by iterative collection of examples (ICE), 
which is shown in Figured] In the algorithm, two example collection procedures, 
random-collection and guided-collection, are used. They are given in Figure El 
random-collection collects a specified number of positive and negative exam- 
ples from the robot’s behavior, caused by a random sequence of actions. The 
actions sequentially change the robot states. This procedure classifies the re- 
sulting states into positive and negative examples by using a given teacher. On 
the other hand, guided-collection collects examples using an induced program 
of allowable-action. It classifies states caused by actions that are taught by a 
program. 

The induction algorithm by ICE starts with examples collected by random- 
collection. It induces a program of allowable-action from the examples. Then, it 
collects examples by guided-collection with the induced program. Collected ex- 
amples are joined with the previous examples and used to induce a new program. 
This process is iterated. 

We expect that there is an advantage with this learning algorithm, which is 
explained by the following two effects of iteration loops. 

1. (Effect of counter-examples) If an induced program is not perfect, it 
may generate inappropriate actions and may cause inappropriate behavior. 
Although the chatty but erroneous teacher may not detect this inappropri- 
ateness, the taciturn but reliable teacher may detect it. A robot can have an 
enormous number of situations, and so these kinds of inappropriate actions 
are observed even after a few iterations of learning. The collected negative 
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random-collection(T : a teacher; N : int) 

Put a robot in a position 

E+ ■- 0; E~ := 0 

While |_E+| < N OT \E~\ < N do 

s := (a sense data at this time) 
go := (a robot status at this time) 

Choose a possible action a at random 
Let the robot take the action a 
qi := (a robot status at this time) 

If T classifies {go, gi) a good action 

then_E''':=iJ+U{allowable-action(s, a)} 
else _E“:=_E“U{allowable-action(s, a)} 

EndWhile 

E~^ := the first N examples of E'*' 

E~ := the first N examples of E~ 

Return (E'^,E~) 

(a) 

guided-collection(T : a teacher; P : a program of allowable-action; Ai : int) 

Put a robot in a position 

E+ ■- 0; E~ := 0 

While \E+\ < N OT \E~\ < N do 

s := (a sense data at this time) 
go := (a robot status at this time) 
a := choose-action(s, P) 

Let the robot take the action a 
gi := (a robot status at this time) 

If T classifies {go, < 7 i) a good action 

then_E''':=iJ+U{allowable-action(s, a)} 
else _E“:=_E“U{allowable-action(s, a)} 

EndWhile 

E'^ := the first N examples of E+ 

E~ ;= the first N examples of E~ 

Return (E'^,E~) 



Fig. 2. Example collection procedures from (a) random trials and (b) an induced rule. 



examples work as counter-examples against the previously induced rules, 
and refine the rules. 

2. (Effect of accumulated positive examples) We can expect appropriate 
behavior from the iteration of induction. If an induced program causes more 
cases of appropriate behavior than the previous iteration, we can collect 
more examples of suitable cases. The examples are accumulated as positive 
examples and used in the induction. If we use a greedy ILP algorithm based 
on coverage, that is, if a clause (i.e. an if-then rule) that explains a large 
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part of the examples is chosen first, then clauses that yield appropriate be- 
havior can be expected to be more likely. As a result, this effect gives stable 
induction of expected clauses in programs. 

These two effects are expected to work to induce increasingly good programs. 
Both effects are related to each other. 

4 Problem — Khepera Robots 

The architecture of robots A Khepera robot has eight optical sensors and 
eight infrared sensors. A level of an optical sensor indicates the distance to a 
source of light, and a combination of levels indicates the direction of the light. 
The infrared sensors measure distances and direction to objects. A sensor re- 
turns an integer value from 0 to 1023. Only the sixteen sensors are a source of 
information that a Khepera robot can use to decide its actions. We can treat 
the sensor information as a 16-dimensional vector 

(oi, • • • 08 , A, • • • is), 

where oi, • • • , os are values of the optical sensors and ii, - ■ ■ ,is are values of the 
infrared sensors. On the other hand, actions that a Khepera can take are to 
drive right and left wheels from the —1 to 3 levels. Although a real Khepera 
has the ability to drive in the —10 to 10 level range, we have restricted it. The 
driving speed of a wheel is in proportion to the value. Negative values mean 
driving backwards. For a given speed to the two driving wheels a Khepera can 
control the speed and direction of its motion. Consequently, the motor actions 
are represented as a 2-dimensional vector 



{mi,mr), 

where mi and are the speed levels of the left and right wheels, respectively. 

The task for a robot Khepera robots provide a good test-bed for research 
into robot control. We consider the problem of letting a Khepera robot go from 
an initial position to a goal position in a maze, which was constructed as shown 
in FigureO A maze is a path from an initial position, denoted by S in Figure E] 
to a goal, denoted by G. A Khepera robot is placed at the initial position and 
allowed to move using induced rules. 

The maze or path is divided into small regions, each of which is divided by a 
straight line and does not include a bend. The lines dividing the maze are shown 
by broken lines in Figure El At the center of the line separating each region, 
there is a light that the Khepera robot can use as a guide. The light will be on 
when the Khepera robot enters the region of the light and it will be off when 
the Khepera leaves. A line, and the light on the line, act as a subgoal for the 
Khepera robot. The Khepera robot is expected to learn control rules to follow 
lights and avoid collisions. 
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Fig. 4. Khepera robot and a 
subgoal. 



choose-action(s : sensor information, P : a program of allowable-action) 
Enumerate all possible actions A — {ai, 02 , • ■ • , a-n} of robots 
Collect allowable actions A' = {a £ A \ allowable-action(s, a) = true wrt P} 
Return choose(^') 



chooser (A) 

If A ^ 0 return an action from A at random 
else return a possible action a at random 



choose2(A) 

If A ^ 0 return an action that equals to the barycenter of A 

else return an action that equals to the barycenter of all of possible actions 

Fig. 5. A procedure choose-action to choose an action. 



The induction problem For a target robot-learning problem, we consider a 
relation 

allowable-action(s, a), 

where s is information from a sensor that a robot receives, and a is an action 
that the robot is allowed to take for good control. This relation indicates a true 
or false for a given pair of s and a. A procedure to calculate an action that a 
robot takes, at the moment that information is being received from a sensor, 
is given in Figure [H There are choices for these action selection algorithms. 
Figure [5] contains two algorithms, choosei and choose2, to switch between in 
the procedure choose-action. choosei selects an action from allowable ones at 
random and choose2 calculates the barycenter; that is, the average motor vector 
of allowable actions as motor vectors. We use choosei in the induction processes, 
and we use two of them in the application processes for comparison. 

We did not take the approach of inducing programs to calculate actions from 
the inputs of sensor information directly, but tried to induce programs of the 
allowable-action relation. This is because the former approach has difficulty with 
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Fig. 6. A snapshot of a Khepera Simulator. 



the executability of induced programs. We need special treatments for induction 
algorithms and for background knowledge, as discussed in jlnuzuka, et ah, 1997| . 

5 Learning Rules of Khepera Robots 

In order to learn control rules of Khepera robots in the maze environment by 
the ICE method, we prepared an inaccurate teacher T. 

Teacher T says yes if the condition {AV B) A -iC is satisfied for an action, 
and says no otherwise. Conditions A, B, and C are defined as follows: 

Condition A : It holds D — D' > 0, 

Condition B : It holds R — R' > 0, and 
Condition C : A collision occurs by the action. 

where D and D' are the distances between the center point of Khepera and the 
light on the subgoal line before and after the action is taken, respectively. The 
angles R and R' are between Line a (the center line of Khepera), and Line b (the 
line from Khepera to the light) before and after the action is taken, respectively. 
These are illustrated in Figure H) 

Condition Ay B constructs a chatty but erroneous teacher. It is based on the 
intuitive idea that a Khepera robot should approach a subgoal and turn towards 
it. This sometimes leads to mistakes because actions obeying this criterion may 
make Khepera touch the wall, or the shortest path contains actions that are 
against this criterion. In order to give a weak condition, we give Ay B but not 
A A B. The weak condition will most likely contain appropriate conditions but 
will also contain inappropriate ones. Condition -iC gives a taciturn but reliable 
teacher, which is based on the global constraint of avoiding collisions. 

To construct the teacher and to collect examples using it, we used free soft- 
ware of a Khepera Simulator \Miche\, 1996| , developed by O. Michel. Figure E] 
shows a snapshot of the simulator. It simulates the specification of Khepera 
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Table 1. Predicates in background knowledge. 



predicates 


explanations 


most-sense-opt)^, N) 
sense-opt)^, N) 
sense-opt-near(far)(S', N) 
nosense-opt(S, N) 
most-sense-inf(S, N) 
sense-inf)^, N) 
sense-inf-near(far)(S', N) 
nosense-inf)^, N) 
go-a head (back) (M) 
move-fast(slow)(M) 
left(right)-turn(M) 
left(right)-rotate(M) 
is- forward (back)-sensor( A) 
is-right(left)-forward-sensor(A) 
is-left(right)-side-sensor(A) 
is-opposite-side-sensor(Ai, N2) 
symmetric(Ai, N2) 


Optical sensor N senses the most in S 

Optical sensor N senses something in S 

Optical sensor N senses more (less) than a level in S 

Optical sensor N does not sense in S 

Infrared sensor N senses the most in S 

Infrared sensor N senses something in S 

Infrared sensor N senses more (less) than a level in S 

Infrared sensor N does not sense in S 

M drives both wheels forward (back) 

M drives wheels fast (slow) 

M drives right wheel faster (slower) than the left 

M drives only right (left) wheel forward 

Sensor A is a forward (back) sensor 

Sensor A is a forward (back) sensor 

Sensor A is a left (right) side sensor 

Sensor Ai is at a mirrored position to sensor A2 

Sensor Ai is at a symmetric position to sensor A2 



robots precisely. We can easily combine a learning method with the simulator. 
We modified the software to implement the subgoal mechanism and to com- 
municate with Prolog processes, which are necessary for using induced logic 
programs. 

Experimental learning was conducted with the teacher and the Khepera 
Simulator. At each example collection, 100 positive and 100 negative exam- 
ples were collected; that is, N in the random-collection and guided-collection is 
set to 100. For the induction process, we used a FOIL-like ILP system FOIL-I 
[Inuzuka, et ah, 1996|lnuzuka, et ah, 1997] with modifications for using an in- 
tentional definition of background knowledge. This modified system was used in 
[Nakano, et al., 1998| . A set of background knowledge was prepared. A part of 
the background knowledge is shown in Table [U Background knowledge consists 
of three groups: knowledge of sensor information; knowledge of motor actions; 
and knowledge of sensor’s positions. They are restricted basic and straightfor- 
ward predicates. 

The detail of the experiment setting is as follows: 

1. Collects a sample set S consists of 100 positive and 100 negative examples 
by random-collection (T, 100), where T is the teacher. 

2. Induce a program P of allowable-action using an ILP algorithm from S and 
background knowledge. 

3. Observe behavior of a robot obeying the choose-action procedure and the 
program P. 
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4. Collects another 100 positive and 100 negative examples by 
guided-collection(T, P, 100), and merge them to S. (This step and the previ- 
ous step can be done simultaneously) 

5. Go to 2 and repeat. 

To compare our method with a normal setting, we have also done the experiments 
with normal collection procedure, i.e.: 

1. Collects a sample set S consists of 100 positive and 100 negative examples 
by random-collection(T, 100), where T is the teacher. 

2. Induce a program P of allowable-action using an ILP algorithm from S and 
background knowledge. 

3. Observe behavior of a robot obeying the choose-action procedure and the 
program P. 

4. Collects another 100 positive and 100 negative examples by 
random-collection (T, 100), and merge them to S. 

5. Go to 2 and repeat. 

6 Results 

Figure [7] shows a typical result of the learning process, where the 10 iterations of 
example collection are carried out. Figure IZla) shows the numbers of collisions 
that occurred during the runs of a Khepera robot using each induced program 
from the ICE method and the random collection method. The numbers are the 
average of five trials. Figure EDb) shows the number of collisions that occurred 
directly from the induced rules. Although a Khepera robot uses induced rules, 
these sometimes say that no actions are appropriate. Nevertheless, a Khepera 
robot has to choose an action and so it chooses one at random. Many collisions 




(a) The number of colli- (b) The number of colli- (c) The action steps of 
sions. sions directly caused by robots to reach a goal. 

induced Rules. 



Real lines and broken lines show the results of the proposed method and a naive 
method, respectively. The *’s represents the cases for which a Khepera robot 
cannot reach a goal. 

Fig. 7. Collisions and actions steps of runs by induced rules. 
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(a) A run with applica- 
tion of rules in the first 
iteration. 















1 



(b) A run with applica- 
tion of rules in the fourth 
iteration. 




(c) A run with applica- 
tion of rules in the tenth 
iteration. 



Fig. 8. Traces of runs of a Khepera robot using induced rules in each iteration. 
Table 2. Comparison of action selection methods. 





choosei choos 62 


collisions occurred 
steps in a run 


6.8 3.4 

240 181 



occur for this reason. FigureE^b) only plots collisions, except those that occurred 
for this reason. In both graphs, Figure ITf a) and (b), the number of collisions 
decreased as the number of examples increased. This tendency is clear in the ICE 
method. The random collection method cannot have a stable effect. Although 
collisions directly caused by induced rules also decreased in the random collection 
method, this was not true for the total number of collisions. This may be due 
to the many examples having the effect of avoiding collisions, but they do not 
work to induce rules that give actions for all cases. Figure [7[c) shows the steps 
taken by a Khepera robot in runs with induced rules. This clarifies the effect of 
rules in efficiency. This graph gives clear evidence that ICE induced better rules. 
Example traces of runs of a robot using induced rules from the first, fourth, and 
tenth iterations are shown in Figure [H 

Table [^summarizes the performance when we used the two different selection 
methods of actions. choose2, which uses actions calculated as a barycenter of 
allowable actions, had a better performance. 

Figure [9] shows results in a different condition. Here we used a teacher with 
the following conditions This teacher says yes if the condition {(A A S') V {A' A 
B)} A -iC is satisfied for an action, and says no otherwise. Conditions A, B, and 
C are the same as the ones defined in Section 5, and Conditions A' and B' are 
defined as follows: 

Condition A' : It holds D — D' > — c • MaxMove and 
Condition B' : It holds R — R' > —c ■ MaxAngle, 
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Case of c = 0.1. 




(a) The number of colli- (b) The number of colli- (c) The action steps of 

sions. sions directly caused by robots to reach a goal. 

induced Rules. 



Case of c = 0.01. 




examples examples examples 



(d) The number of colli- (e) The number of colli- (f) The action steps of 
sions. sions directly caused by robots to reach a goal. 

induced Rules. 

Real lines and broken lines show the results of the proposed method and a naive 
method, respectively. The *’s represents the cases for which a Khepera robot cannot 
reach a goal. 

Fig. 9. Results from examples with tighter teachers. 



where MaxMove is the maximun distance that the robot can move in a step, 
MaxAngle the maximum angle that the robot can turn in a step, and c is a 
constant in 0 < c. In the case of c = oo both Conditions A' and B' are always 
held and the teacher is the same as the teacher T in Section 5. The smaller c a 
teacher defined, the tigher condition it has. Results of conditions with c = 0.1 
and c = 0.01 is shown in Figure |9] We can observe worse behavior in the case of 
c = 0.1 and more worse in c = 0.01 than the weak teacher (Figure [7|), although 
the ICE method works much better than the rondom collection procedure. 
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7 Concluding Remarks 



This paper proposed an iterative learning method, ICE, in the ILP framework us- 
ing the iterative collection of examples method. Examples are collected that are 
guided by previously induced programs. Negative examples collected using the 
previous rules are near-misses, which are important negative examples for refin- 
ing the rules, as emphasized in [Dietterich and Michalski, 1983| Winston, 1975| . 
Moreover, the positive examples also worked to collect appropriate positive ex- 
amples, because actions lead to many new situations that are guided by the 
previously induced rule. As we discussed in Section |3] iteration of induction and 
collection of examples refine the results effectively. This was evidenced by the 
experiments using a Khepera robot. 

Iterative learning in the field of robot control is a major approach and has 
been investigated with many ML methods, such as neural networks, genetic 
algorithms, and reinforcement learning [Franklin, et ah, 1996| . The ILP frame- 
work, however, has not used iterative learning because a special treatment is 
needed that refines the induced rules. The proposed method uses iteration, for 
example collection, and does not need any special mechanism for ILP meth- 
ods. Direct application of concept learning to robot control is a difficult task. 
[Kingspor, et ah, 1996 tried to learn concepts in robot learning using a hierarchi- 
cal structure of concepts. Another work related to the control problem includes 
[Dzeroski, et ah, 1995| , which tried to induce control rules in a straightforward 
way. This paper presents the advantages in inducing rules in a difficult situation 
for robot learning. 

Another important contribution of this paper is the presentation of a mech- 
anism for incremental refinement using an ILP framework. This can be applied 
only for cases that hold the following properties: 



1. An inaccurate teacher can be provided with the property explained in Sec- 
tion O and 

2. Various instances of a target relation can be generated by using induced 
rules. 



For inaccurate teachers, a taciturn but reliable teacher plays important roles for 
generating counter-examples, in particular. Planning application may have these 
properties. If we can provide a method for generating instances with induced 
rules, we may apply the ICE procedure to wider fields. 
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Abstract. In on-line reinforcement learning, often a large number of 
estimation parameters (e.g. Q-value estimates for 1-step Q-learning) are 
maintained and dynamically updated as information comes to hand dur- 
ing the learning process. Excessive variance of these estimators can be 
problematic, resulting in uneven or unstable learning, or even making ef- 
fective learning impossible. Estimator variance is usually managed only 
indirectly, by selecting global learning algorithm parameters (e.g. A for 
TD{\) based methods) that are a compromise between an acceptable 
level of estimator perturbation and other desirable system attributes, 
such as reduced estimator bias. In this paper, we argue that this approach 
may not always be adequate, particularly for noisy and non-Markovian 
domains, and present a direct approach to managing estimator variance, 
the ccBeta algorithm. Empirical results in an autonomous robotics do- 
main are also presented showing improved performance using the new 
ccBeta method. 



1 Introduction 

Many domains of interest in robot learning (and in AI more generally) are too 
large to be searched exhaustively in reasonable time. One approach has been to 
employ on-line search techniques, such as reinforcement learning (RL) | 14| . 

At first blush, RL presents an excellent conceptual fit for many autonomous 
robotics applications. However, physically situated learning agents face a num- 
ber of specific challenges not directly addressed by the Markov decison pro- 
cess/dynamic programming theoretical framework that is conventionally used 
to analyse RL. In general, situated learning domains will appear to be noisy, 
non-stationary, and the state of the world will not be fully observable. The 
learning agent will possibly have to contend with other agents, which may be 
sympathetic with, benignly indifferent to, or agressively antagonistic towards 
the agent’s goals. There will be real-time constraints that have to be addressed. 

All of the above challenges the simplifying assumptions that provide the usual 
preamble to discussing methods for solving Markov decision processes. In this 
paper we attempt to address at both a theoretical and practical level a particular 
problem for which several of the considerations we have listed above have direct 
import, namely that of managing estimator variance. 



J. Wyatt and J. Demiris (Eds.): EWLR 1999, LNAI 1812, pp. 84 41021 2000. 
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In on-line reinforcement learning, typically a large number of estimation 
parameters (e.g. Q-value estimates for 1-step Q-learning) are maintained and 
dynamically updated as information comes to hand during the learning pro- 
cess. Excessive variance of these estimators during the learning process can be 
problematic, resulting in uneven or unstable learning, or even making effective 
learning impossible. 

Normally, estimator variance is managed only indirectly, by selecting global 
learning algorithm parameters (e.g. A for TD{X) based methods) that trade-off 
the level of estimator perturbation against other system attributes, such as esti- 
mator bias or rate of adaptation. In this paper, we give reasons why this approach 
may sometimes run into problems, particularly for noisy and non-Markovian do- 
mains, and present a direct approach to managing estimator variance, the ccBeta 
algorithm. 

1.1 RL as On-Line Dynamic Programming 

If an RL method like 1-step Q-learning (QL) |ltij is used to find the optimal policy 
for a Markov decision process (MDP), the method resembles an asynchronous, 
on-line form of the dynamic programming value iteration method. QL can be 
viewed as a relaxation method that successively approximates the so-called Q- 
values of the process, the value Q'^{st,at) being the expected value of the return 
by taking an action a* from state St and following a policy tt from that point on. 

We note that if an RL agent has access to Q-values for an optimal policy for 
the system it is controlling, it is easy to act optimally without planning; simply 
selecting the action from each state with the highest Q-value will suffice. The QL 
algorithm has been shown to converge, under suitable conditions, to just these 
Q-values. The algorithm is briefly recounted below. 

At each step, QL updates an entry in its table of Q-value estimates according 
to the following rule (presented here in a “delta rule” form): 



The 1-step CTR is a special case of the n-step CTR. Using Watkins’ (1989) 



Q{sti fli) Q{st, o-t) + /3A( 



( 1 ) 



where is a step-size (or learning rate)0 parameter, and 



Lit = - Q(st,«t) 

where is the 1-step corrected truncated return (CTR): 

= U + 7maxQ(st-K,a) 

a 



(2) 



( 3 ) 



notation 




^ The terms step-size and learning rate are both used throughout this paper, and are 



essentially interchangeable. 
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where r|"^ is the simple uncorrected n-step truncated return (UTR) 

n— 1 

As u — >■ oo, both and approach the infinite horizon or actual return 

OO 

r* = X ^*’’*+* 

i=0 

as a limiting case. 

We note that if in were replaced with the actual return, then this 

would form the update rule for a Monte Carlo estimation procedure. However, 
rather than waiting for the completed actual return, QL instead employs a “vir- 
tual return” that is an estimate of the actual return. This makes the estimation 
process resemble an on-line value iteration method. One view of the n-step CTR 
is that it bridges at its two extremes value iteration and Monte Carlo methods. 
One could also make the observation that such a Monte Carlo method would be 
an on-line, asynchronous analog of policy iteration, another important dynamic 
programming technique. 

The central conceptual importance of the CTR to RL techniques is that 
virtually all RL algorithms estimate the value function of the state/action pairs 
of the system using either single or multi-step CTRs directly, as in the case of 
QL or the C- Trace algorithm [S], or as returns that are equivalent to weighted 
sums of varying length n-step CTRs, such as the TD{\) return Pile] - 

2 CTR Bias and Variance 

For RL in Markovian domains, the choice of length of CTR is usually viewed 
as a trade-off between bias and variance of the sample returns to the estimation 
parameters, and hence of the estimation parameters themselves (e.g. Watkins 
1989). 

The idea is that shorter CTRs should exhibit less variance but more bias 
than longer CTRs. The increased bias will be due to the increased weight in the 
return values of estimators that will, in general, be inaccurate while learning is 
still taking place. The expected reduction in estimator variance is due to the 
fact that for a UTR the variance of the return will be strictly non-decreasing as 
n increases. 

Applying this reasoning uncritically to CTRs is problematic, however. In the 
case of CTRs, we note that initial estimator inaccuracies that are responsible for 
return bias may also result in high return variance in the early stages of learning. 
Thus, in the early stages of learning, shorter CTRs may actually result in the 
worst of both worlds - high bias and high variance. 

By way of illustration, consider the simple MDP depicted in Figure [U The 
expected variance of the sampled returns for action 0 from state A will be arbi- 
trarily high depending upon the difference between the initial estimator values 
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Fig. 1. A 4-state/l action MDP. State A is the starting state, states B and C are 
equiprobable successor states after taking action 0. Actions from B and C immediately 
lead to termination. The immediate reward at each step is zero. If 1-step CTRs are 
used, the variance as well as bias of the estimator returns for State A/action 0 depends 
on the difference of the initial estimator values for states B and C. On the other hand, 
if CTRs of length 2 or greater are used, the estimator returns will be unbiased and 
have zero variance. 



for actions from states B and C. In this case, the estimator for (7l,0) would 
experience both high bias and high variance if 1-step CTRs were to be used. 
On the other hand, using CTRs of length 2 or greater would result in unbiased 
estimator returns with zero variance at all stages of learning for this MDP. 

In general, as estimators globally converge to their correct values, the variance 
of an n-step CTR for an MDP will become dominated by the variance in the 
terms comprising the UTR component of the return value, and so the relation 

* < j => uar[r|*^] < uar[rp^] (7) 

will be true in the limit. However, the point we wish to make here is that a 
bias/ variance tradeoff involving n for CTRs is not as clear cut as may be often 
assumed, particularly in the early stages of learning, or at any stage of learning 
if the domain is noisy0 even if the domain is Markovian. 



2.1 CTR Bias and Variance in NMDPs 

Perhaps more importantly, if the domain is not Markovian, then the relation 
expressed in (H) is not guaranteed to hold for any stage of the learning. To 
demonstrate this possibly surprising fact, we consider the simple 3-state non- 
Markov Decision Process (NMDP) depicted in Figure E] 

^ The argument here is that for noisy domains, estimator bias is continually being 
reintroduced, taking the process “backwards” towards the conditions of early learn- 
ing. 
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Fig. 2. A 3-state NMDP, with two available actions from starting state A, and one 
available from the successor state B. The action from state B immediately leads to 
termination and a reward; the decision process is non-Markov because the reward 
depends on what action was previously selected from state A. 



For this NMDP, the expected immediate reward from taking the action 0 
in state B depends upon which action was taken from state A. Suppose the 
(deterministic) reward function p is as follows: 

p(A,0)=0 

p(A,l)=0 

p{B, 0) = 0 (if the action from state A was 0) 

p{B, 0) = 1 (if the action from state A was 1) 

If Monte Carlo returns are used (or, equivalently in this case, n-step CTRs 
where n > 1), the estimator returns for state/action pairs {A, 0) and (A, 1) will 
exhibit zero variance at all stages of learning, and the corresponding estimators 
should rapidly converge to their correct values of 0 and 1 respectively. 

On the other hand, if 1-step CTRs are used, the variance of the {A, 0) and 
{A, 1) estimators will be non-zero while the variance of the estimator for {B, 0) is 
non-zero. The estimator for {B, 0) will exhibit non-zero variance as long as both 
actions continue to be tried from state A, which would normally be for all stages 
of learning for the sake of active exploration. Finally, note that the variance for 

(i?, 0) will the same in this case for all n-step CTRs n > 1. Hence, the overall 

estimator variance for this NMDP is strictly greater at all stages of learning for 
1-step CTRs than for any n-step CTRs n > 1. 

In previously published work studying RL in noisy and non-Markovian do- 
mains [8], excessive estimator variance appeared to be causing problems for 
1-step QL in domains where using Monte Carlo style returns improved matters. 
These unexpected experimental results did not (and still do not) fit well with the 
“folk wisdom” concerning estimator bias and variance in RL. We present these 
analyses firstly as a tentative partial explanation for the unexpected results in 
these experiments. 

Secondly, the foregoing analysis is intended to provide some theoretical moti- 
vation for an entirely different approach to managing estimator variance in RL, 
in which attention is shifted away from CTR length and is instead focused on the 
step-size parameter /3. In the next part of this paper, we discuss a new algorithm 
(perhaps more accurately a family of algorithms) we call ccBeta, which results 
from taking such an approach. 
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3 /3: Variance versus Adaptability 

In RL, finding a good value for the step-size parameter j3 for a particular al- 
gorithm for a particular domain is usually in a trial-and-error process for the 
experimenter, which can be time consuming. The resulting choice is usually a 
trade-off between fast adaptation (large /3) and low estimator variance (small /3) . 
In RL, and in adaptive parameter estimation systems generally, there emerges a 
natural tension between the issues of convergence and adaptability. 

Stochastic convergence theory m suggests that a reducing j3 series (such as 
i5i = \ji) that satisfies the Robbins and Monro criteria 

oo oo 

^/3i = oo, and ^/3^<oo (8) 

i=l i=\ 

may be used to adjust an estimator’s value for successive returns; this will guar- 
antee in-limit convergence under suitable conditions. However, this is in general 
not a suitable strategy for use in non-stationary environments i.e. environments 
in which the schedule of payoffs may vary over time. While convergence proper- 
ties for stationary environments are good using this technique, re-adaptation to 
changing environmental conditions can be far too slow. 

In practice, a constant value for the (3 series is usually chosen. This has 
the advantage of being constantly sensitive to environment changes, but has 
poorer convergence properties, particularly in noisy or stochastic environments. 
It appears that the relatively poor convergence properties of a constant /3 series 
can lead to instabilities in learning in some situations, making an effective trade- 
off between learning rate and variance difficult. 

A method for automatically varying the step-size (3 parameter by a simple 
on-line statistical analysis of the estimate error is presented here. The resulting j3 
series will be neither constant nor strictly decreasing, but will vary as conditions 
indicate. 



3.1 The ccBeta Algorithm 

At each update step for the parameter estimate, we assume we are using a 
“delta-rule” or on-line LMS style update rule along the lines of 

^ Qi—1 

Qi ^ Qi-1 + f3i^\i 



where Zi is the returned value in the series we are trying to estimate and Pi 
is the value of the step-size schedule series used. Qi is the estimate of this 
series, and Ai is the value of the error series. 

The idea behind ccBeta is quite straightforward. If the series of estimate 
errors for a parameter is positively auto-correlated, this indicates a persistent 
over- or under-estimation of the underlying value to be estimated is occurring, 
and suggests P should be increased to facilitate rapid adaptation. If, on the other 
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hand, the estimate errors are serially uncorrelated, then this may be taken as 
an indication that there is no systematic error occurring, and (3 can be safely 
decreased to minimise variance while these conditions exist. 

So, for each parameter we are trying to estimate, we keep a separate set of 
autocorrelation statistics for its error series as follows, where cci is derived as an 
exponentially weighted autocorrelation coefficient: 

sumsquare-erri ^ K.sumsquare-erri-i + Af (9) 

sumjproducti ■<— K.sumjproducti-i + Ai.Ai-i (10) 

sumjproducti , . . , 

ccj •<— — ■ - (11) 

^ sum^square-erri . sumsquare-erri-i 

At the start of learning, the sumsquare-err and suni-product variables are 
initialised to zero; but this potentially leads to a divide-by-zero problem on the 
RHS of (fTT)l . We explicitly check for this situation, and when detected, cci is set 
to lH 

We note that if in (0 and m the exponential decay parameter K G 
[0, 1] is less than 1, two desirable properties emerge: firstly, the values of the 
sumsquare-err and sumjproduct series are finitely bounded, and secondly the 
correlation coefficients are biased with respect to recency. While the first prop- 
erty is convenient for practical implementation considerations with regard to 
possible floating point representation overflow conditions etc., the second prop- 
erty is essential for effective adaptive behaviour in non-stationary environments. 
Setting AT to a value of 0.9 has been found to be effective in all domains tested so 
far; experimentally this does not seem to be a particularly sensitive parameter. 

It is also possible to derive an autocorrelation coefficient not of the error 
series directly, but instead of the sign of the error series, i.e. replacing the Ai 
and Ai-i terms in (|9| and m with sgn{Ai) and sgn{Ai-i). This variant may 
prove to be generally more robust in very noisy environments. 

In such a situation an error series may be so noisy that, even if the error 
signs are consistent, a good linear regression is not possible, and so (3 will be 
small even when there is evidence of persistent over- or under-estimation. This 
approach proved to be successful when applied to the extremely noisy real robot 
domain described in the next section. Based on our results to date, this version 
could be recommended as a good “general purpose” version of ccBeta. 

Once an autocorrelation coefficient is derived, (3i is set as follows: 

if {cCi > 0) 

A ^ cci * MAX. BET A 
else /3i ^ 0 

if iP, < MIN .BETA) 

P, G- MIN. BET A 

® This may seem arbitrary, but the reasoning is simply that if you have exactly one 
sample from a population to work with, the best estimate you can make for the 
mean of that population is the value of that sample. 
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First, we note that in the above pseudo-code, negative and zero auto-correlat- 
ions are treated the same for the purposes of weighting A strongly negative 
auto-correlation indicates alternating error signs, suggesting fluctuations around 
a mean value. Variance minimisation is also desirable in this situation, motivating 
a small (3i. 

On the other hand, a strongly positive cci results from a series of estima- 
tion errors of the same sign, indicating a persistent over- or under-estimation, 
suggesting a large j3i is appropriate to rapidly adapt to changed conditions. 

Setting the scaling parameters MIN_BETA to 0.01 and MAXJ3ETA to 1.0 
has been found to be effective, and these values are used in the experiments 
that follow. Although values of 0 and 1 respectively might be more “natural”, 
as this corresponds to the automatic scaling of the correlation coefficient, in 
practice a small non-zero MIN_BETA value was observed to have the effect of 
making an estimate series less discontinuous (although whether this offers any 
real advantages has not been fully determined at this stage.) 

Finally, we note that prima facie it would be reasonable to use ccf rather than 
cCi to weight j3t- Arguably, cc? is the more natural choice, since from statistical 
theory it is the square of the correlation coefficient that indicates the proportion 
of the variance that can be attributed to the change in a correlated variable’s 
value. 

Both the cCi and ccf variations have been tried, and while in simulations 
marginally better results both in terms of total square error (TSE) and variance 
were obtained by using ccf , a corresponding practical advantage was not evident 
when applied to the robot experiments. 



4 Related Work 

Darken, Chang and Moody [4j described a heuristic “search then converge” re- 
ducing learning rate schedule designed for use with backpropagation networks. 
The essential idea is that the learning rate remains almost constant at the be- 
ginning of the series (“search mode”) but after a certain point (determined by 
parameter choice) starts reducing roughly in inverse proportion to time step t 
(“converge mode”)0 Thus, we would expect the adaptation properties of such 
a schedule would resemble that of a fixed rate schedule while in “search mode” 
(fast adaptation but high estimator variance), and a reducing schedule when in 
“converge mode” (poor adaptation but low estimator variance). 

According to Jacobs |5], Kesten first suggested that an alternating er- 
ror sign could be used as an indication that a step-size parameter should be 
reduced in the context of a stochastic approximation method, and that Saridis 
HD! extended the idea to both increasing and decreasing the learning rate (i.e. 
if successive errors are of the same sign, the step-size should be incremented). 

Bradtke and Barto |2| applied Darken et al.’s learning rate scheduler to an on- 
line temporal difference reinforcement learning algorithm; they conjectured that the 
way they did this would indirectly satisfy the Robbins and Monro criteria (refer to 
Equation 11 for convergence of their estimators. 
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Jacobs himself proposes the Delta-Bar-Delta learning rate update rule for 
use in the context of multi-layer nonlinear backpropagation networks. This is a 
refinement of earlier work by Barto & Sutton [T], and Sutton |T^. Jacobs con- 
trasts the properties of his proposed update rule with some of the shortcomings 
of previously proposed methods. Jacobs also proposes four general heuristics 
applicable to a step-size adjustment method: 

— Every system parameter should have its own learning rate 

— Every learning rate should be allowed to vary over time 

— When the derivative of a parameter possesses the same sign for several con- 
secutive time steps, the learning rate for that parameter should be increased 

— When the sign of the derivative of a parameter alternates for several consec- 
utive time steps, the learning rate for that parameter should be decreased 

We note that the ccBeta method proposed in this paper satisfies these principles, 
even though Jacobs was primarily concerned with improving rates of convergence 
over steepest descent techniques in the context of backpropagation networks. 
Further, we might claim that ccBeta uses a more statistically principled approach 
in adjusting the step-size by directly calculating the 1-step autocorrelation of the 
error signal time series; the heuristics proposed for Delta-Bar-Delta method are 
plausible but without a clear theoretical underpinning. Jacobs himself claims no 
special status for the implementation of these heuristics he proposes, concluding 
his paper with the comment “Clearly there are other implementations of the 
heuristics worth studying.” 

Sutton m later proposed an incremental version of the Delta-Bar-Delta rule 
(IDBD) for use in a linear function approximator. Like Jacobs and Darken et 
ah, he analyses his method in terms of gradient descent, which is natural in 
the context of stochastic function approximation when all system weights are 
generally updated in parallel. In the context of on-line RL, as we discuss in this 
paper, this analysis does not apply directly, however. To point to one obvious 
difference, typically only a subset of state or state/action pair value estimators 
are updated upon each time step. 

Nonetheless, all of the basic intuitions that motivate the learning rate adap- 
tation techniques for stochastic approximation methods apply equally well to the 
situation where we are interested in updating Q-value or state value estimates in 
an on-line RL setting. In the experiments that are described in the next section, 
we test these intuitions directly. 

5 Experimental Results 

In all the experiments desribed in this section we use the same variant of ccBeta; 
this has a K parameter of 0.9, and uses sgn(A) normalisation to calculate cci. For 
weighting /3j, we use cCi rather than ccf , and have MIN_BETA and MAX J3 ETA 
set to 0.01 and 1.0 respectively. 
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5.1 Simulation Experiment 1 

In this section, we describe some simulation studies comparing the variance and 
re-adaptation sensitivity characteristics of the ccBeta method for generating a 
(3 step-size series against standard regimes of fixed (3 and reducing (3, where 
f3i = Iji. 

The learning system for these experiments was a single-unit on-line LMS 
estimator which was set up to track an input signal for 10,000 time steps. In 
the first experiment, the signal was stochastic but with stationary mean: a zero 
function perturbed by uniform random noise in the range [-0.25,-1-0.25]. The 
purpose of this experiment was to assess asymptotic convergence properties, in 
particular estimator error and variance. 

As can be seen from Table [H the reducing beta schedule of j3i = 1/i was 
superior to fixed beta and ccBeta in terms of both total square error (TSE) and 
estimator variance for this experiment. As we would expect, variance (normalised 
to standard deviation in the tabled results) increased directly with the magnitude 
of beta for the fixed beta series. In this experiment ccBeta performed at a level 
between fixed beta set at 0.1 and 0.2. 

Table 1. Results for simulation experiment 1. 



Simulation experiment 1 (noisy zero fn) 
T.S.E. Std. Dev. 


ccBeta 


17.95 


0.042 


beta = 0.1 


10.42 


0.032 


beta = 0.2 


22.30 


0.047 


beta = 0.3 


35.72 


0.060 


beta = 0.4 


50.85 


0.071 


reducing beta 


0.17 


0.004 



5.2 Simulation Experiment 2 

In the second experiment, a non-stationary stochastic signal was used to assess 
re-adaptation performance. The signal was identical to that used in the first 
experiment, except that after the first 400 time steps the mean changed from 0 
to 1.0, resulting in a noisy step function. The adaptation response for the various 
beta series over 50 time steps around the time of the change in mean are plotted 
in Figures Eland m 

To get an index of responsiveness to the changed mean using the different 
beta series, we have measured the number of time steps from the time the mean 
level was changed to the time the estimator values first crossed the new mean 
level, i.e. when the estimator first reaches a value > 1. 

As can be seen from Table E] the reducing beta schedule of /3i = 1 ji was far 
worse than for either fixed beta or ccBeta in terms of TSE and re-adaptation 
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performance (“Steps to Crossing”, column 2). Indeed, the extreme sluggishness 
of the reducing beta series was such that the estimator level had risen to only 
about 0.95 after a further 9,560 time steps past the end of the time step window 
shown in Figure The relatively very high TSE for the reducing beta series 
was also almost entirely due to this very long re-adaptation time. The inherent 
unsuitability of such a regime for learning in a non-stationary environment is 
clearly illustrated in this experiment. Despite having “nice” theoretical proper- 
ties, it represents an impractical extreme in the choice between good in-limit 
convergence properties and adaptability. 



Table 2. Results for simulation experiment 2. 



Simulation experiment 2 (noisy step fn) 





T.S.E. 


Steps to Crossing 


ccBeta 


21.60 


10 


beta =0.1 


16.19 


35 


beta = 0.2 


25.84 


15 


beta = 0.3 


38.56 


9 


beta = 0.4 


53.35 


8 


reducing beta 395.73 


>9,600 



In the Figure[3]plots, the trade-off of responsiveness versus estimator variance 
for a fixed beta series is clearly visible. We note however that the re-adaptation 
response curve for ccBeta (Figure [4|) resembles that of the higher values of fixed 
beta, while its TSE (Table [2]) corresponds to lower values, which gives some 
indication the algorithm is working as intended. 

5.3 Experiment 3: Learning to Walk 

For the next set of experiments we have scaled up to a real robot-learning prob- 
lem: the gait coordination of a six- legged insectoid walking robot. 

The robot (affectionately referred to as “Stumpy” in the UNSW AI lab) is 
faced with the problem of learning to walk with a forward motion, minimising 
backward and lateral movements. In this domain, ccBeta was compared to using 
hand tuned fixed beta step-size constants for two different versions of the C- Trace 
RL algorithm. 

Stumpy is, by robotics standards, built to an inexpensive design. Each leg 
has two degrees of freedom, powered by two hobbyist servo motors. A 68HC11 
Miniboard converts instructions from a 486 PC sent via a RS-232-C serial port 
connection into the pulses that fire the servo motors. The primary motion sensor 
is a cradle-mounted PC mouse dragged along behind the robot. This provides 
for very noisy sensory input, as might be appreciated. The quality of the signal 
has been found to vary quite markedly with the surface the robot is traversing. 

As well as being noisy, this domain was non-Markovian by virtue of the 
compact but coarse discretized state-space representation. This compact repre- 
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Fig. 5. The robot learning to walk. 



sentatiorlfl meant learning was fast, but favoured an RL algorithm that did not 
rely heavily on the Markov assumption; in earlier work C-Trace had been 
shown to be well-suited for this domain. 

The robot was given a set of primitive “reflexes” in the spirit of Brooks’ 
“subsumption” architecture [^. A leg that is triggered will incrementally move 
to lift up if on the ground and forward if already lifted. A leg that does not 
receive activation will tend to drop down if lifted and backwards if already on 
the ground. In this way a basic stepping motion was encoded in the robots 
“reflexes” . 

The legs were grouped to move in two groups of three to form two tripods 
(Figure El • The learning problem was to discover an efficient walking policy by 
triggering or not triggering each of the tripod groups from each state. Thus the 
action set to choose from in each discretized state consisted of four possibilities: 

— Trigger both groups of legs 

— Trigger group A only 

— Trigger group B only 

— Do not trigger either group 

The robot received positive reinforcement for forward motion as detected by 
the PC mouse, and negative reinforcement for backward and lateral movements. 

Although quite a restricted learning problem, interesting non-trivial behav- 
iours and strategies have been seen to emerge. 

® 1024 “boxes” in 6 dimensions: alpha and beta motor positions for each leg group 
(4 continuous variables each discretized into 4 ranges), plus 2 Boolean variables 
indicating the triggered or untriggered state of each leg group at the last control 
action. 
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Fig. 6. Grouping the legs into two tripods. 



5.4 The RL Algorithms 

As mentioned earlier, C-Trace is an RL algorithm that uses multi-step CTRs to 
estimate the state/action value function. While one C-Trace variant, every-visit 
C-Trace (Figure [3, has been described in earlier work in application to this 
domain |^, the other, first-visit C-Trace (Figure |7|), is a variant that has not 
been previously described. 

As mentioned earlier, C-Trace is an RL algorithm that uses n-step CTRs to 
estimate the state/action Q-value function. The length n of CTRs vary, a return 
truncation/correction occurring on the time steps when a non-policy action is 
selected for the sake of active exploration. The idea is to maximise the average 
length of n to minimise bias and to speed learning, while retaining the QL prop- 
erty of “experimentation insensitivity”, which means that the effects of active 
exploration do not affect the value function being learnt (at least not for Markov 
domains) jS]. 

It is easiest to understand the difference between an every-visit and first- 
visit C-Trace in terms of the difference between an every-visit and a first-visit 
Monte Carlo algorithm. Briefly, a first-visit Monte Carlo algorithm will selec- 
tively ignore some returns for the purposes of learning in order to get a more 
truly independent sample set. In E], first-visit versus every-visit returns are dis- 
cussed in conjunction with a new “first- visit” version of the TD(X) algorithm, 
and significant improvements in learning performance are reported for a variety 
of domains using the new algorithm. 

For these reasons, a first- visit version of C-Trace seemed likely to be particu- 
larly well-suited for a ccBeta implementation, as the “purer” sampling method- 
ology for the returns should (in theory) enhance the sensitivity of the on-line 
statistical tests. 

5.5 Discussion of Results 

The average forward walking speed over the first hour of learning for two versions 
of C-Trace using both ccBeta and fixed step-size parameters are presented in the 
plots in Figures HD and ITT71 
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for all 


s G S, a G A do 




endfor 


VisitCountsa 0 


( initialise traces ) 


while not terminal state do 
get current state s 






select action a 

if non-policy action selected then 


( stochastic action selection ) 




for all i ^ S, j G A such that VisitCountij 
k ■<— GlobalClock — StepCountij 
c ■<— max^ Qim 


> 0 do 




Pij <- Pij +j’’c 
n ■<— VisitCountij 

Sumij <— (1 — f3)Sumij xijpij 


( truncated return correction ) 




Qij <— (1 — -1- Surriij 


{ apply Ci'H update J 




VisitCountij ■<— 0 

endfor 

endif 


( zero traces ) 




if VisitCountsa — 0 then 
Suirisa ■<— 0 

Xsa 0 


( first visit ) 




else 

SurUsa (1 — (3)Sumsa + XsaPsa 


( subsequent visits ) 




k ■<— ClobalClock — StepCountsa 

Xaa <- (1 - 0)Xaal'‘ + 0 

endif 

Psa ^ 0 

VisitCountsa VisitCountsa + 1 


( steps since last visit ) 




StepCountsa ■<— ClobalClock 
take action a 

if reinforcement signal r received then 


( “time-stamp” visit ) 




for all (i,j) such that VisitCountij > 0 do 
k ■<— ClobalClock — StepCountij 
Pij <- Pij + 

endfor 

endif 

ClobalClock •<— ClobalClock 1 


endwhile 




for all 


s ^ Sy a ^ A such that VisitCountsa > 0 do 
n ■<— VisitCountsa 
Sunisa (1 — (3)Sunisa + XsaPsa 


( terminal state reached ) 


endfor 


Qsa <— (1 — 0)''Qaa “t" Suvriaa 


( terminal update rule ) 



Fig. 7. Pseudo-code for every-visit C-Trace. S and A denote the state set and the 
action set respectively. The variables Q, Sum, x, p, StepCount and VisitCount are 
kept separately for each state/action pair. The subscripts identify to which state/action 
pair the variable belongs; e.g. Qij is the Q-value estimate for state i and action j. 



In the case of every-visit C-Trace (Figure E|, we notice immediately that 
the learning performance is much more stable using ccBeta than with the fixed 
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for all s G S, a G A do 




VisitCountsa 0 


( initialise traces ) 


endfor 




while not terminal state do 




get current state s 




select action a 


( stochastic action selection ) 


if non-policy action selected then 




for all i ^ S, j G A such that VisitCountij 


> 0 do 


k ■<— GlobalClock — StepCountij 




c ■<— max^ Qim 




Pij <- Pij +j’’c 


( truncated return correction ) 


Qij (1 “ 0)Qij + 0Pij 


( apply CTR update ) 


VisitCountij i— 0 


( zero traces ) 


endfor 




endif 




if VisitCountsa — 0 then 


( first visit ) 


Psa 0 




VisitCountsa ■<— 1 




StepCountsa ClobalClock 


( “time-stamp” visit ) 


endif 




take action a 




if reinforcement signal r received then 




for all i G S, j G A such that VisitCountij 


> 0 do 


k ■<— ClobalClock — StepCountij 




Pij <- Pij -1- 




endfor 




endif 




ClobalClock ■<— ClobalClock 1 




endwhile 




for all s G S, a G A such that VisitCountsa > 0 do 


( terminal state reached ) 


Qsa (1 — &)Qsa + 0Psa 


( terminal update rule ) 


endfor 





Fig. 8. Pseudo-code for first- visit C-Trace. S and A denote the state set and the action 
set respectively. The variables Q, p, StepCount and VisitCount are kept separately for 
each state/action pair. The subscripts identify to which state/action pair the variable 
belongs; e.g. Qij is the Q- value estimate for state i and action j. 



beta series. This shows up as obviously reduced variance in the average forward 
speed; significantly, the overall learning rate doesn’t seem to have been adversely 
affected, which is encouraging. It would not be unreasonable to expect some 
trade-off between learning stability and raw learning rate, but such a trade-off 
is not apparent in these results. 

Interestingly, the effects of estimator variance seem to manifest themselves 
in a subtly different way in the first- visit C-Trace experiments (Figure [TOJ. We 
notice that first-visit C-Trace even without ccBeta seems to have had a marked 
effect on reducing the step-to-step variance in performance as seen in every-visit 
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Fig. 9. The robot experiment using RL algorithm 1 (every- visit C-Trace). 




Fig. 10. The robot experiment using RL algorithm 2 (first- visit C-Trace). 
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C-Trace. This is very interesting in itself, and calls for further theoretical and 
experimental investigation 0 

However, we also notice that at around time step 8000 in the first- visit fixed- 
beta plot there is the start of a sharp decline in performance, where the learning 
seems to have become suddenly unstable. 

These results were averaged over multiple runs (n = 5) for each plot, so this 
appears to be a real effect. If so, it is conspicuous by its absence in the first- visit 
ccBeta plot: ccBeta would appear to have effectively rectified the problem. 

Overall, the combination of first-visit C-Trace and ccBeta seems to be the 
winning combination for these experiments, which is encouragingly in agreement 
with prediction. 

6 Conclusions 

Excessive estimator variance during on-line learning can be a problem, resulting 
in learning instabilities of the sort we have seen in the experiments described 
here. 

In RL, estimator variance is traditionally dealt with only indirectly via the 
general process of tuning various learning parameters, which can be a time con- 
suming trial-and-error process. Additionally, theoretical analysis presented here 
indicate that some of the pervasive ideas regarding the trade-offs involved in 
the tuning process need to be critically examined. In particular, the relationship 
between CTR length and estimator variance needs reassessment, particularly in 
the case of learning in a non-Markov domain. 

The ccBeta algorithm has been presented as a practical example of an al- 
ternative approach to managing estimator variance in RL. This algorithm has 
been designed to actively minimise estimator variance while avoiding the degra- 
dation in re-adaptation response times characteristic of passive methods. ccBeta 
has been shown to perform well both in simulation and in real-world learning 
domains. 

A possible advantage of this algorithm is that since it is not particularly 
closely tied in its design or assumptions to RL algorithms, Markov or otherwise, 
it may turn out be usable with a fairly broad class of on-line search methods. 
Basically, any method that uses some form of “delta rule” for parameter updates 
might potentially benefit from using a ccBeta-style approach to managing on-line 
estimator variance. 

® At this point, we will make the following brief observations on this effect: a) The 
reduction in variance theoretically makes sense inasmuch as the variance of the sum 
of several random variables is equal to the sum of the variances of the variables if 
the variables are not correlated, but will be greater than this if they are positively 
correlated. With every-visit returns, the positive correlation between returns is what 
prevents them being statistically independent, b) This raises the interesting possi- 
bility that the observed improved performance of “replacing traces” owes as much if 
not more to a reduction in estimator variance than to reduced estimator bias, which 
is the explanation proposed by m 
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Abstract. We present here a neural model for mobile robot action se- 
lection and trajectories planning. It is based on the elaboration of a 
“cognitive map” . This cognitive map builds up a graph linking together 
reachable places. We first demonstrate that this map may be used for 
the control of the robot speed assuring a convergence to the goal. We 
show afterwards that this model enables to select between different goals 
in a static environment and finally in a changing environment. 



1 Introduction 

One of our research interests is to propose architectures for controlling mobile 
robots. These architectures are mainly constrained by the facts that we 
want the representations to be “grounded” on the sensor data (construc- 
tivist approach) and that we want an analogical computation at each stage. 
These requirements particularly fit in the neural network (N.N.) framework. 
Furthermore, the N.N. approach enables to take inspiration from neurobi- 
ological and psychological results, as well for the architectures as for the 
learning processes (see |Baloch and Waxman, 199T| , [Millan and Torras, 1992| , 
[Verschure and Pfeifer, 1992| , IGaussier and Zrehen, 199^ for other works on 
this issue). We also stress that our robot has its own internal “motivations” 
and that the whole learning process is under their influence. This is very 
important because our system has to “behave” autonomously in an a priori 
unknown environment (animat approach [Meyer and Wilson, 19911 ). So we will 
call our robot an “animat” . Our control architecture is mainly composed of 
two parts. The first one enables the animat to locate itself in the environment 
[Gaussier et ah, 1997a[ . We will not discuss about this part in this paper. The 
second one builds a graph linking successively reached places (“cognitive map” 
[Tolman, 1948[ ). It is important to note that no external a priori information 
about the external world is given to the system. Learning new locations and 
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their links with each other is the core process of the system. In addition, there 
is also a reflex mechanism for obstacle following. This mechanism takes control 
of the animat when an obstacle is encountered. The created map is typical 
for a given environment. It allows the animat to smartly solve various action 
selection problems in a static world. But the map may be updated so that the 
same system may work if the environment has changed (door closed or moving 
person for instance) . The other dynamics followed by the system is given by the 
animat movements in the environment. These movements are either random 
when the animat is exploring, or given by the planning map when the animat 
has to reach a goal. Under the neural field formalism {dynamic approach to 
autonomous robotics |Sch5ner et ah, 1995| ), we deduce from the planning map 
a control variable for the animat speed (see [Goetz and Walters, 1997] for a 
review of control techniques). This enables a smooth approach of the goal. 

In the following, we first describe the experimental environment and the 
planning architecture. Then we show how the planning map may be interpreted 
in terms of a dynamic approach for speed control. Finally, we present the results 
about finding a path in a static and then in a changing world. 

2 Experiment 

The experiments reported in this paper are performed in a simulated environ- 
ment. However, some of them are already running on a real robot. The ani- 
mat only sees obstacles and landmarks. It may not see the resource locations. 
The environment may be as large as 40 animat sizes. Three different energy 
levels decrease exponentially over time unless re-supplied by a proper source 
[Gaussier et ah, 1998| . Since we are in an animat approach, these levels will 
be called “food”, “water” and “sleep”. They are linked with the motivations 
“hunger” , “thirst” and “rest” . A more robotic way of putting it in words would 
be to consider for instance a “charging station”, and two different “delivery 
points”, provided the robot gets rewards being at these locations. So, through 
random exploration, the animat has to discover “food” and “water” places where 
it may re-supply one of the two levels. From time to time, the animat has to go 
back to its nest for rest (it knows where it is at start). The animat has to go 
back to a previously discovered interesting place when it needs to. The animat 
must therefore solve the dilemma between exploring and reaching a known food 
and/or water place and/or rest. When on a source location, the animat source 
level increases by a given amount. When the source is empty, another one ap- 
pears randomly in the environment. Hence, the animat always has to explore the 
environment in order to find new potential sources. In addition, we impose that 
the levels of the essential variables are maintained between two thresholds that 
define the comfort area. Hence, as soon as the level of one of these variables is 
below the lower threshold, the animat goes toward a resource place. Gonversely, 
when the level of a particular essential variable is higher than a top threshold, 
the animat goes away from the resource place. The animat succeeds quite cor- 
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rectly in maintaining the 3 essential variables in the comfort area. A cycle of 
go and return takes place between the different types of sources. These cycles 
change only when the animat succeeds, during a random exploration phase, in 
finding a new resource place, or when a location is depleted. The animat can fail 
only when new sources appear in areas difficult to explore so that there is a low 
probability for a single animat to find the solution. In this case, there is a need 
to have a cooperation between several animats Gaussier et ah, 1997b|. 



3 Model 



We will not fully develop here our navigation model (see [Gaussier et ah, 1998| , 
IGaussier and Zrehen, 199^ , [Gaussier et al., 2000 for a complete presentation) . 
What should be reminded is that a location is defined by a set of landmarks and 
azimuth of each landmark. As the animat explores the environment, it learns 
new locations when they are different enough from the previous ones. A neuron 
is coding for each learned position in the environment. The closer the animat 
to the neuron coding for a position, the higher the response of that neuron. We 
don’t provide any external explicit description of the environment (world model), 
nor learn what to do for each location in the environment [Burgess et ah, 1994[ . 
In order to be able to go back to a learned location, the animat has to keep track 
of where it was and where it may go from any location. So we need to introduce 
a kind of map of the environment to be able to perform this task. To do so, 
we add a group of neurons able to learn the relationships between successively 
explored places. The temporal proximity being equivalent to a spatial proximity, 
the system creates a topological representation of its environment. We will call 
this last group our “cognitive map” (or goal map) (fig. [^ and [Tj) [Tolman, 1948[ , 
[Thinus-Blanc, 1996] , [Revel et al., 1998] . 

Let Wij be the weight associated with the fact that from the place i it is pos- 
sible to reach directly place j. At the goal level (fig.[T]), the motivations activate 
directly the goal neurons linked with the goal to be achieved (the goal neuron 
intensity is proportional to the motivation). This activity is then diffused from 
neuron to neuron through the synaptic weights. Other works also use a diffusion 
mechanism for planning paths [Connolly et al., 1990] , [Bugmann et al., 1995[ , 
[Bugmann, 1997| . Our goal diffusion algorithm is the following : 

— io is the goal neuron activated because of a particular motivation. Gi^ is its 
activity. 

Gjq ^ — 1 and, Gi i — 0 Vz yf zq 

— WHILE the network activity is not stabilized, DO: 'ij,Gj ^ max(Wy.Gi) 



where Gi is the value of neuron i (see Gaussier et al., 1998 for more details on 
the navigation algorithm). Wij has to be strictly inferior to 1 (0 < Wij < 1) to 
ensure the algorithm convergence. On figure [T] an example of activity diffusion 
from motivation “A” is presented. The algorithm allows to find the shortest 
path in the graph. The algorithm is proved to always find the shortest path in 



106 M. Quoy et al. 




Fig. 1. Global architecture of the planning system. The recognition level allows to 
identify the situations when the animat arrives in the vicinity of a stored place. These 
situations are directly linked to the goal level which allows to plan a route from one 
attractor to the next until the objective. When a motivation “A” activates a goal, a 
propagation of the information is performed in the direction of all nodes in the graph 
(here all weights are equal to 0.9). 



the graph (it is equivalent to the Bellman-Ford algorithm of graph exploration 
Pellman, 1958| , [Revel, 1997| ). 

The animat tries to follow the gradient of neuron activity in this cognitive 
map to select the next location to reach. The most activated goal or subgoal 
in the neighborhood of the current animat location is then selected and used 
to attract the animat in its vicinity. When the animat is close enough to 
this location, the associated subgoal is inhibited and the animat is attracted 
by the next subgoal and so on until it reaches the final goal. The princi- 
ple of this kind of cognitive maps is not new [Arbib and Lieblich, 1977 
[Schmajuk and T hieme, 1992| , 

[Schblkopf and Mallot, 1994|, 



Bacheld er and Waxman, 
[Bugmann et al., 1995| , 



1994 



Trullier et al., 1997 



pfranz et al., 199^ . The novelty in this paper is that our algorithm allows to 
solve planning problems involving several moving goals in a dynamic environ- 
ment (the sources may disappear and appear again elsewhere, obstacles may 
also be moved, landmarks may be hidden). The map is learned and modified 
on-line and allows to manage contradictory goals. The subgoals correspond to 
the following situations: the end of an obstacle avoidance (for instance, the 
animat stores the pathway between two rooms: location of the door) or places 
badly recognized (activity of all neurons below a defined recognition threshold. 
The higher this threshold is, the more learned places there are: denser coverage 
of the open space). We obtain a spatial “paving” of the environment which is 
almost regular (fig.|2I)- 

Learning of the cognitive map is performed continuously. There is no separa- 
tion between the learning and the test phases. The links between neurons of the 
graph are reinforced (hebbian associative learning) for neurons associated with 
successively recognized places. The learning rule is the following: 

= -A.w,,, + (C+ ).(1 - w ,^,). g ;. g , 



( 1 ) 
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Fig. 2. On the left, reflex behavior of obstacle following for reaching the goal (Initial 
speed towards the upper left. The animat has a mass giving a momentum (see section 
4)). The dots indicate the trajectory followed by the animat. On the right, “cognitive 
map” built by exploration of the same environment. The landmarks are the crosses 
on the border. Each circle is a subgoal. The links indicate that the two subgoals have 
been activated in succession. The subgoals and the learned transitions form the goal 
or cognitive map. 



Gi must be held to a non null value until Gj (with i ^ j) is activated by the 
recognition of the place cell j. This is performed by a time integration of the 
Gi values represented in the equation by Gi. Gi decreases with time and can 
be used as a raw measure of the distance between i and j. A is a very low 
positive value. It allows forgetting of unused links. C = 1 in our simulations. 
The term ^ corresponds to the variation of an external reinforcement signal 
(negative or positive) that appears when the animat enters or leaves a “difficult” 
or “dangerous” area. 

4 Use of the Planning Map for Speed Control 

The planning map is constructed in order to indicate which path to take to 
reach the goal. In addition, the map may give us another information: how to 
control the speed of the animat in order to assure to go to the goal and stop 
on it. Hence the planning map not only indicates where the goal is, but also 
enables to control how to reach it. For this purpose, we have to consider that 
the animat has a mass M. At the same time, we also add friction F. To balance 
this effect, an internal drive is also present. When the animat is exploring, this 
drive is random in strength and direction. This enables to randomly explore 
large areas. When it is goal seeking, the animat is going from one subgoal to 
another until reaching the goal. This is performed by following the gradient of the 
motivations on each subgoal. So when the animat is going towards the nearest 
goal (or sub-goal), the direction of the drive is given by the direction of the goal, 
and the strength may be either independent from the distance to the final goal 
(hence constant), or depend on this distance (hence depend on the gradient). 
The equations driving the animat movement are now (we took the same form as 
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in [Schmajuk and Blair, 199^): 



x{t + 1) = x{t) + dt'vx{t) 



( 2 ) 



where the x subscript denotes a projection on the x axis. The same equations 
hold for the y axis, v is the speed and (x,y) the position of the animat. F = 5, 
M = 5 and drive a force (either random or towards the goal). The values of 
these constants have been chosen in order to exhibit nice “gliding” trajectories 
illustrating the effect of the external drive in this equation, dt and dt' are small 
integration constants. In order to stay at the same position (when a goal is 
reached for instance), the speed must be 0. Hence drive = 0 at the goal location. 

So when the drive is constant, the speed will tend to and the animat can 

not stop on the goal. If the drive depends on the distance to the goal, then one 
has to make drive tend to 0 as the goal is nearer. In the case where the drive is 
not constant, it is the difference between the value of the nearest subgoal and the 
value of the second nearest subgoal which gives the strength of the drive. This 
value is computed during the diffusion of the motivation from the goal to the 
subgoals. Using equation|3] this value is G„ = lU", where W is the (fixed) weight 
between the subgoals and n the distance (expressed in number of subgoals) to 
the goal. Figure [3] shows the strength of the subgoal G„ versus the distance to 
the goal (dotted line). 

As one can see, as the animat comes close to the goal, the gradient increases. 
Hence as the animat approaches the goal, the drive towards it increases. This 
leads to a non null speed when arriving on the goal and an unstable reaching 
behavior (dotted line in fig. 0]). In order to avoid this unstable behavior (or to 
stop on the goal), the speed must decrease as the animat comes closer to the 
goal. So the activation value used to compute the gradient is now: 



Fin = exp - 



2.^2 



log{Gn) - log{W) 
= exp 5 



( 3 ) 



cp where a is chosen to be 5. 

The speed near the goal is now reduced, enabling a smooth approach (fig. 
EJ. This is particularly suitable if the animat has to stop on the goal. 

Comparing the two drive politics, it appears that the first one enables to 
reach the goal sooner, with a high speed when arriving on the goal. The goal is 
reached faster because its attraction strength is high, so that the animat does 
not “glide” as much as in the second case. However, there may be also cases 
where the goal is not reached at all (see below). In the second case, the animat 
may take a longer time to reach the goal, but arrives there with a very low speed. 
However, due to the smaller strength of the attraction it may sometimes “circle 
around” the goal until it is reached (see fig. El). 

These results are backed by the theoretical framework of dynamical systems. 
In particular, the way the map is constructed fits very nicely in the “neural 
field” framework [Amari, 1977 , [Schoner et al., 1995|. Indeed, we can extend 
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Fig. 3. Hn versus n (solid line), and Gn (dotted line). In the first case, the gradient de- 
creases as the goal is being reached. Whereas in the second case, the gradient increases, 
leading to a high speed near the goal. The two figures below show left the value of the 
gradient in the environment when the goal is in the middle of the room. The arrows 
indicate the direction of the gradient and its strength (left gradient of Hn and right 
gradient of G„). 




Fig. 4. Left: Trajectory of an animat driven by the gradient computed using Hn (solid 
line) and Gn (dotted line). The dots are the learned subgoals. The planning map links 
are not shown. Right: The same conventions are used for displaying the speed versus 
time in the two same cases. Speed is higher in the second case than in the first one. 



the notion of distance expressed in number of subgoals by making it continuous 
(“behavioral dimension” [Schbner et ah, 1995| ). So, one can now consider that 
the goal has an activation field extending over each subgoal. The value of the 
field at subgoal n is Hn (fig.[3|). The same way it is possible to control the head- 
ing direction of an animat |Bicho and Schbner, 1997] by controlling its angular 
velocity, we control the position of the animat through modifications of its speed 
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Fig. 5. Trajectory of an animat driven by the gradient computed using (solid line) 
and Gn (dotted line). The dots are the learned subgoals. The planning map links are 
shown. The animat reaches the goal sooner in the second case, but with a high speed. 



introduced by the gradient of i?„. In this case the goal position is a stable fixed 
point of the dynamics n = F{n) (fig. El). When n = 0, the animat is on the goal 
(and its speed is 0 either). Moreover, as it is a stable fixed point, we are sure 
that the animat will converge on it (in asymptotical time theoretically). This 
is not the case when using a gradient computed with G„. The animat may not 
converge on the goal at all because the goal is not a fixed point anymore (fig.Ej). 



gradient 




Fig. 6. Gradient of Hn (solid line) versus n and gradient of (dotted line) versus n. 
In the first case, n = 0 is a stable fixed point for the dynamics h = F{n) = In 
the second case, this not the case. 



5 Use of the Planning Map for Finding Paths 

We will now show in the next subsections that the animat is able to learn to 
choose between goals and that the planning map may be used in a changing 
environment. 

5.1 Learning Paths in a Fixed Environment 

To allow a measurement of the performances, the environment is chosen in the 
following experiments as a T-maze. However, this maze is not discretized into 
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squares (other than for statistical results, see below), but allows a continuous 
computation of the animat positions. The width of a corridor is 7 animat sizes. 
The need to go to the nest increases twice as fast as the need for food or water. 
Hence the animat has to go back very often to its nest. In order to suppress the 
possible biases introduced by an autonomous map building, the cognitive map 
is learned during a teleoperated exploration of the maze. The operator driving 
the animat follows the middle of the corridors. The maze and the cognitive map 
are displayed as the first figure of each experiment shown thereafter. In order 
to record the preferred paths taken by the animat, we have divided the maze 
into squares. Each square has the size of an animat. However note that these 
squares are not used in the computation of the movement. They are only used in 
this statistical analysis. The different figures show histograms of the occupation 
of each square in the maze. Histograms are computed from the average of 50 
different runs for each experiment. One run corresponds to 20000 iterates of the 
animat behavior in the same T-maze. The animat needs less than 2000 iterates 
to construct the complete planning map. 

Because the activation level of a particular subgoal is the maximum of the 
back-propagated motivational information, we believed at first that our algo- 
rithm was unable to choose correctly between satisfying one motivation or sev- 
eral simultaneous motivations. For instance, we thought it would be unable to 
always choose the left arm of the T maze fig. [TT] that allows to satisfy at the 
same time 2 motivations (something that the “free flow” architecture of Tyrrell 
[Tyrrell, 1993| succeeds in doing at the price of local minima problems and a 
priori simplification in the graph). Of course, it is possible to decide that if a 
goal is associated with 2 motivations those two motivations are summed before 
being diffused with our max rule. Unfortunately, if the two interesting sources 
are not exactly at the same place but are in neighbor areas, this trick cannot be 
used. If we change the max rule into a simple addition rule we can face deadlock 
situations because the diffused activity can amplify itself in loop situations or 
when a node receives the diffusion of a lot of other nodes. So we have tried our 
algorithm as it is and we have verified it was nevertheless able to find the good 
solution after a while because of on line weight adaptation. 

Figure 13 shows a symmetrical maze. What happens in this case is that the 
left or the right trajectory is randomly reinforced. Hence the animat goes either 
to the right arm or to the left arm to drink. But if during a random exploration, 
it goes in the other arm, it reinforces this path, and therefore enables its use 
again. Hence, the animat uses only one path until it goes in the other arm and 
then may use only the other path, and so on. 

Figure El and O show what happens when one of the arms is extended. As 
expected, the only water source used after a while is the one in the right arm. 
The exploration of the left arm does not modify enough the weights between 
the subgoals so as to make the animat choose this water source for drinking. 
Instead, as the animat goes more often near the water source in the right arm, 
these links are reinforced. 



112 



M. Quoy et al. 




Fig. 7. The figure on the left displays the initial maze, with the nest on the bottom, 
water sources in the left and the right arm and a food source at the top. In the figure on 
the right, the height of each square shows the number of time it has been occupied by 
the animat (dark (resp. light) color indicates low (resp. high) occupation. The number 
of iterations is 20000. The values have been computed adding 50 different runs. Here, 
as the maze is symmetrical, no arm is preferred for going to drink. 



Fig. 8. Same setup as in the previous experiment. The left arm is extended, so that the 
left water source is farther away from the nest and from the right water source than 
in Figured Hence, statistically, the animat prefers to go drinking in the right water 
source. 

In the last two experiments (fig. [inland [HJ, the arm where the food source is, 
is shifted to the left. The preferred water source is the one near the food source. 
When the animat goes eating, it may afterwards explore the maze. Since it is 
in the left arm, the animat is more likely to go in the left end. Hence the links 
between the food source and the left water source are reinforced. Only from time 
to time when the animat has enough time to go to the right end are the links to 
the right water source reinforced. 

The difference between the experiments shown in figures (B1 El CHI and dH is 
the shifting to the left of the upper arm containing the food. This move enables 
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Fig. 9. Same setup as in the previous experiment. The difference with figure [8] is the 
farther location of the water source in the left arm in this case. As this source is moved 
away to the left, it gets less used by the animat for drinking. 




Fig. 10. Same setup as in the previous experiment. The animat visits the right source 
more often than the left one. 



the animat to use the left arm water source more often. And in the extreme 
(last) case, it is now the left water source which is a lot more often used than 
the right one. This shows the importance of the reinforcement of used links (and 
conversely the decrease of the unused links) . This property allows the animat to 
“behave smartly” by not wasting its energy going to far away sources. 

5.2 Learning New Paths in a Changing Environment: 

Preliminary Results 

A real environment does not always stay the same. New obstacles may appear 
or disappear (somebody walking or a door opened or closed, for instance). The 
change in the environment may first affect the recognition of a place. But this 
is not the case in our system. Indeed, it stands up the loss (or the addition) of 
landmarks provided at least half of the landmarks used for learning a location 
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Fig. 11. Same setup as in the previous experiment. The water source on the left, near 
the food source, is more often used than the water source in the right arm. 



are still visible. However, a changing environment may dramatically impair the 
relevance of a learned cognitive map: some paths may not be used anymore (door 
closed), some others may be used (door opened). If an obstacle suddenly appears, 
the reflex mechanism allowing to follow the obstacle enables to nevertheless 
reach the goal (see figure [2D . On the contrary, new paths are found by random 
exploration of the environment when it is not goal seeking. We have tested our 
algorithm as it is in a changing environment (see figure IT21 . In this environment, 
the door opening enabling a direct path may be closed. The animat has learned 
a cognitive map with the door open. When there are two ways to go from one 
source to the other one, the direct path is preferred, but the other one is also 
used (left figure). In the second experiment, the direct path is closed by a door. 
The animat now goes through the other path. 

What makes it possible to change the environment is that the map is con- 
stantly updated either by creation of new subgoals (or goals) or by increase 
(resp. decrease) of the value of used (resp. unused) links. We have shown in the 
previous section how the reinforcement of some links may lead to the emergence 
of preferred paths. Now, it is the decrease of the link values which will help “for- 
get” unusable paths. For instance, if the animat has learned a path between two 
walls and this aperture is suddenly closed (by a door for instance) , the cognitive 
map is not valid anymore. What happens now is a decrease to 0 of the link 
going “through the wall” . The principle for the degeneration of a link is that it 
is not reinforced anymore because its two ends are not activated in succession 
anymore. So, the link decreases to 0 due to the passive decay term in equation 
[T] The 3 typical cases one may encounter when a path may not be used anymore 
are summarized in figure 1131 In the first two cases, one of the ends of the link 
leading through the wall may not be activated at all. Hence this link degenerates. 
The only problematic case is the last one presented. Indeed, there, the place cell 
in the wall may fire either when the animat is on the right or from the left side. 
Hence the left link, as well as the right link may be reinforced. The solution is 
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Fig. 12. Same setup as in the previous experiment. In the first experiment (left 2 
figures), the animat may go from one source to the other through either door. In the 
second experiment, the top door is closed. The animat is able to change its behavior 
and goes through the last open door. 



a) 



b) 



c) 



Fig. 13. A door (light hatching) has closed the pathway between two walls. There are 
3 different cases depending on where the subgoals were learned when the door was 
open. In case a), there is no subgoal “in” the door. Since the 2 subgoals on either side 
of the door can not be activated successively, the link between them degenerates. In 
case b), for the same reason the links between the subgoals in the middle and the ones 
outside may not be reinforced. In case c), however the links may be reinforced. 



the creation of a new subgoal where the animat hits the door. The situation will 
then be the same as case b) . The creation of this new subgoal (in planning mode 
specifically, in exploration mode there is no need to create a new subgoal when 
hitting an obstacle) has not been implemented in the system for the moment. 
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It is worthwhile to note that after the closure of the door the environment 
has changed: the landmarks of the other room are not visible anymore. Hence 
the situation which has led to the creation of a subgoal near the door opening 
is not valid anymore either. In some cases, these subgoals may not be activated 
at all (even if the animat is exactly on them) because other subgoals, which 
correspond to the new environment, have been learned. Indeed, changing the 
environment also changes the set of locations for which a place cell responses. 
New place cells are learned corresponding to the new landmark configuration. 
Their attraction basin extends over the previously created ones. Hence, in 
the extreme case, a place cell learned in the old environment may not fire 
at all in the new one, and so no new links may be created with it. However, 
the old ones still exist (even if their value is close to 0), but may not be 
reinforced anymore. If the old environment is presented again, these links may 
be reactivated. The animat must first discover that the path is usable again, 
and use it for some time before the links values reach a high level again. Now, 
the same degeneration process happens for the links between place cells created 
in the second environment. Hence several different “layers” of cognitive maps 
can appear in the same physical N.N. structure. They may be linked together 
through some place cells valid in more than one environment, and may be 
activated when the environment they are coding is presented again. The animat 
has to try some time before building a new efficient planning map. So, even if 
it may be a good solution to use a very low passive decay (A parameter) to 
store several different maps (memory effect), it also slowers down the process of 
finding new pathways, when one may not be used anymore. Indeed, the hebbian 
learning rule we have chosen needs some time before significantly changing the 
weights. Hence, in order to react faster to a change in the environment, it would 
be necessary to introduce an active decay mechanism decreasing unused links. 



6 Discussion 

The navigation and planning system we have presented is able to solve complex 
action selection problems. However, for the moment, it has to really perform the 
movements in order to reinforce particular paths. An improvement would be the 
possibility to internally replay the trajectory used to reach a goal. 

The main drawback of the algorithm is the computation of the gradient of 
Gn- Indeed, in very large environments (with a great number of subgoals) the 
gradient may be very small. Hence, first, the drive on the speed would be very 
small too (leading to a long time before reaching the goal). Second, if there is 
a small noise on the gradient, it would be now impossible to follow it. So there 
is a need to combine several maps of different joint environments. This means 
we have to define a planning structure, or plans of plans, in order to address 
large scale environments. This need is also highlighted by the use of the same 
neural structure for storing all maps linked with different environments. There 
will soon be an explosion of the number of neurons needed to code all locations. 
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and moreover a mix up between maps coding for totally different environments. 
So there is a need to “transfer” the map in another structure where it may be 
memorized. 

Concerning the interpretation of the planning map in terms of activation 
field, it is important to stress that this process is a top down one. Indeed, 
the map has first been build using the place recognition mechanism. Now, 
in return the use of the map gives information for the low level control of 
the animat. Moreover, this activation field approach could be generalized 
using the neural field paradigm |Amari, 1977| . Each goal can have a previous 
defined activation. The activations may be combined using the neural field 
equations. The advantage is that the animat will stay focused on a goal until 
either another goal becomes stronger or an external events occurs. This is also 
very interesting for us because it is easy to introduce in our neural architec- 
ture. Following a moving object is already coded that way Gaussier et ah, 19^ . 



This work was supported by two French GIS contracts on Cognitive Sciences enti- 
tled “comparison of control architectures for the problem of action selection” in collabo- 
ration with the Animat lab (J.Y. Donnart, A. Guillot, J.A. Meyer), LISC (G. Deffuant) 
and RFIA (F. Alexandre, H. Frezza), and “Mobile robots as simulation tools for the dy- 
namics of neurobiological models: adaptation and learning for visual navigation tasks” 
in collaboration with the CRNC (G. Schoner) and the LAAS (R. Chatila). 
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Abstract. In this paper two computationally efficient methods for build- 
ing a map of the occupancy of a space based on measurements from a 
ring of ultrasonic sensors are presented. The hrst is a method based on 
building a histogram of the occurrence of free and occupied space. The 
second is based on the calculation of the rate between occupied space 
measurements with respect to the total. The resulting occupancy maps 
have been compared with those obtained with other well-known methods, 
both count as well as Bayes-rule-based ones, in static environments. Free 
space, occupied space and unknown labels were also compared subse- 
quent to the application of a simple segmentation algorithm. The results 
obtained gave rise to statistically significant differences between all the 
different types on comparing the resulting maps. In the case of comparing 
occupancy labels, no differences were found between the following pairs 
of methods: RATE and SUM (p — value = 0.157), ELFES and RATE 
(p — value = 0.600) and ELFES and SUM (p — value = 0.593). 



1 Introduction 

Two of the basic aims of sensorial perception in navigation tasks are the detection 
of free space and the construction of maps of the environment. For these two 
tasks ultrasound (US) sensors have been widely used, due to their cheapness, 
wide range and low computational cost. Nevertheless, the obtention of a sensor- 
based environment map is a complicated problem due to the scarce and relatively 
imprecise information that is obtained from it. 

One of the most common approaches to map building using US sensors, 
and one that has been implemented in real, operative robots, is the setting up 
of an occupancy grid. Its main advantages are simplicity, low computational 
cost and human readability. Some of the most widely used approaches to the 
theme are probabilistic techniques based on Bayes theorem [Elfes, 1989a] [Elfes, 
1989b] [Konolige, 1996] [Thrun, 1998], although there are other techniques that 
are based on fuzzy logic [Gambino, Oriolo and Ulivi, 1996] [Oriolo, Ulivi and 
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Fig. 1. NOMAD 200 



Vendittelli, 1998], the theory of evidence [Pagac, Nebot and Durrant- Whyte, 
1998] [Duckett and Nehmzow, 1998] and count methods [Borenstein and Koren, 
1991]. Probabilistic tools revolve around 2 fundamental aspects: 

a) Assignation of an occupancy probability to each cell based on a single 
measurement. 

b) Probability accumulation across readings 

Nevertheless, in the probability calculation, terms that are difficult to eval- 
uate appear, which are directly dependent on the environment, and which have 
to be assigned in an arbitrary manner. On the other hand, the map updating 
process can become computationally expensive. Another common criticism of 
these techniques is based on the fact that the number assigned by the occu- 
pancy probability has neither sufficient quality nor quantity of information on 
the occupation of a cell. 

Based on these considerations, we have developed two algorithms that are 
very efficient from a computational perspective, and that are based on count 
methods, where we deal with the number of times that a free or occupied space 
has been detected in stead of the probability of occupancy. The count idea is 
not a new one; for example in [Borenstein and Koren, 1991] it was used for 
the real-time construction of an obstacle map, which later uses an algorithm to 
avoid obstacles. In spite of not being suitable for our free-space detection needs, 
a variant of this method adapted to our purpose was used. 

All the previously mentioned map construction techniques have been imple- 
mented on a NOMAD 200 robot (Fig. 1) and an experiment has been carried 
out in order to compare the maps obtained under the different processes. 
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The remainder of this paper is structured in the following manner: firstly, the 
different map construction methods are presented; then the experimental design 
used for the comparison of the different techniques are explained, and the results 
obtained are given, and finally, the results obtained are discussed. 



1.1 Related Work 

The need for an environment representation has given rise to several designs 
of map building methods for navigation purposes, starting with those that use 
the environment itself as its own representation, hence only the current sensor 
readings are considered in order to drive the current actuator commands, as in 
Brooks’ subsumption architecture [Brooks, 1986]. 

Other authors use the sensor-input history to build an internal environment 
map, which is used by navigation algorithms to generate the motor outputs, 
thus the current behaviour of the robot is driven not only by the current sensor 
input, but by all the inputs in the input history. In this context, Gasos and 
Martin [Gasos and Martin, 1997] generated maps that consisted of a fuzzy geo- 
metric primitives database that was updated with the new sensor US readings. 
Kuypers and Byun [Kuipers and Byun, 1991] used the US sensor readings to 
generate a hierarchical map with several description levels (control level, geo- 
metric level and topological level) in which the central part is the topological 
network description. The arcs in the topological map are assigned to a control 
behaviour and geometric information can be added to all the components of the 
topological graph. Nehmzow and Smithers [Nehmzow and Smithers, 1991] used 
self-organized networks to distinguish space locations. The input vectors to the 
network were obtained from both sensor measurements and motor commands. 
Koenig and Simmons [Koenig and Simmons, 1998] used Markov models to repre- 
sent the perception process. Yamauchi and Beer [Yamauchi and Beer, 1996] used 
a method that built a so-called Adaptive Place Network, in which a graph-based 
topological map is combined with metric information. In this context, occupancy 
grids are used to recalibrate dead reckoning. 

Aside from probabilistic and histogramic methods, other occupancy grid 
map building techniques were also implemented, including those based on fuzzy 
logic [Gambino, Oriolo and Ulivi, 1996] [Oriolo, Ulivi and Vendittelli, 1998] and 
the theory of evidence [Pagac, Nebot and Durrant- Whyte, 1998] [Duckett and 
Nehmzow, 1998]. 

2 Bayesian Models 



There are two fundamental aspects in probabilistic occupancy grid construction 
methods [Elfes, 1989a] [Konolige, 1996] [Thrun, 1998]: the sensor model, i.e. the 
assignation of the occupancy probabilities of each cell within the range of a sensor 
for a single sensorial measurement, and the accumulation of these individual 
probabilities from multiple measurements. 
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The most common probability assignation procedure takes the form of a 
Gaussian sensor model, in which terms that modify the Gauss distribution ac- 
cording to the distance and angular position with respect to the axis of the sonar 
beam appear. Thus a typical expression would be: 



p{D I r,e) 



2TrS{r)a 



( 1 ) 



where p{D \ r, 6) is the probability density function for the event “obtain a 
measurement D the objective being situated at a distance r and form- 
ing an angle 9 with the axis of the sonar beam”. Here, o;(r) and 5{r) 
represent factors that modulate normal distribution with distance, and a rep- 
resents the typical deviation for the angular deviation (values in the region of 
11°) [Konolige, 1996]. Other authors [Thrun, 1998] use neural networks for the 
initial assignation of probabilities. 

Another characteristic of these methods is the manner of representing uncer- 
tainty on an arc of the sector in which the echo is generated. The introduction 

of the factor in (1) or of the factor (1 — in other cases, aims to correct 

this uncertainty by assigning higher probability values to the central parts of the 
arc, as opposed to the lateral ones. Nevertheless, the introduction of these fac- 
tors is heuristic, and there are a multitude of situations in which this uncertainty 
cannot be modelled by means of this procedure. 

The updating of the probability assigned to a cell is carried out by means of 
the Bayes rule (2) 



P{A 1 B) 



P{B 1 A)P{A) 

iW) 



(2) 



where A and B are two events, P(A) is the probability of A, P(B) is the proba- 
bility B, P{B \ A) represent the probability of B conditioned to A and P{A \ B) 
represent the probability of A conditioned to B. If we represent A as the event 
C = {the cell C is occupied}, and B as the event [X = D] ={an echo has been 
obtained at a distance D|, then (2) will lead to 



P(G 1 [X=D]) 



p([X=D] 1 G)P(G) 

p([X=D] 1 G)P(G)+p([X=D] 1 G)P(G) 



(3) 



where the term p{\X = D] \ C) would be represented by (1), P{C) would 
represent the probability prior to the current reading, and P(C \ [X = D\) is 
the posterior occupancy probability. 

In other references [Konolige, 1996], the Gaussian curve (1) is substituted 
by curves taking the form y{9,r) = (1 — |j 2 )(l — ^), but the Bayesian basis is 
maintained. One difficulty with regard to these methods lies in finding a suitable 
distribution for the term p{[X = D] \ C). This represents the probability of 
obtaining a reading of occupancy when the cell is empty. An exact calculus 
involves all the possible states of all the cells in the map, so computations over 
the 2^ possible environment states, where N is the number of cells in the map, 
are necessary. 
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2.1 ELFES Method 

Two Bayesian methods found in the literature were implemented: ELFES [Elfes, 
1989a] [Elfes, 1989b] and MURIEL [Konolige, 1996], both of them particular 
cases of the general Bayesian technique described above, with slight differences 
in sensor models (1) and probability accumulation (3). 

In ELFES [Elfes, 1989a] [Elfes, 1989b] the terms o;(r) and <5(r) in (1) are 
constant and heuristically set to 1 and 0.02, respectively, which means that no 
distance dependency is assumed. In the probability accumulation procedure it 
is assumed that P{[X = D] \ C) = 1 — P([X = D] \ C), which means that 
the probability of obtaining a value of occupancy at a distance D, the cell being 
empty, is assumed to be equal to the probability of not obtaining a value of 
occupancy at a distance D, the cell being busy. 

Probability values below 0.1 were set to 0.1 and probability values above 
0.9 to 0.9, with the aim of avoiding not being able to change non-occupancy 
and occupancy certainties (0 and 1) that are inherent in the Bayes rule (if the 
probability of the occupancy of a cell is 1 or 0, then this value will be permanent 
as a consequence of the expression (2)). This allow this method to respond to 
changes in the environment, or to fix sensor misreadings. 

To ilustrate how this method works, we will consider the occupancy values 
of a given cell before and after the sensor reading, and examine two particular 
cases: 

i) The cell is between the echo and the robot, and further away from the 
echo. In this case the second term in the denominator of (3) dominates and 
P{C \ \X = D\) takes a value that is closer to 0 than the old one. 

ii) The cell is close to the echo. In this case p{\X = D] \ C) in (3) is higher 
than p{\X = D] \ C), and P{C \ \X = D\) will take a value that is closer to 1 
than the old one. 

2.2 MURIEL Method 

MURIEL [Konolige, 1996] is a sophisticated probabilistic method that includes 
independent sensor data collection, specular reflection Altering and Bayesian 
probability accumulation. Both independent data collection and specular reflec- 
tion Altering were developed under the static world hypothesis, and may have 
unwanted side effects in dynamic environments. As these two features are ad- 
ditional to the probability accumulation, and equivalent mechanisms can be 
proposed for the rest of the methods, a modified version in which these features 
were not present was implemented for comparation purposes. 

This method starts with a modified Gaussian sensor model in which the 
possibility of the existence of multiple targets is considered. The expression pos- 
tulated for the probability density of obtaining a measurement X=D given the 
cell occupied is 



a{ri) - 

\/^5{ri) 



1 

2 5(rj)2 




+ F 



p{[X = D]\C) 



(4) 
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being F a small constant representing a constant probability of specular re- 
flection. In this sensor model of this method (4), distance dependent terms, 
a{r) and 5(r), are heuristically introduced as a(r) = 0.6(1 — min(l, .25r)) and 
5(r) = 0.01 -I- 0.015r, and it is assumed that the term p{[X = D] \ C) in the 
probability accumulation procedure is the heuristic quantity F. 

Given the existence of multiple targets, and given the possibility of obtaining 
several echoes from one single sensor reading, the event that the detected echo 
is the nearest one is differentiated. The following notation is used 
[r > U] : No return less than D 
[r = D\ and [r > D] 

Under the hypothesis of independence between [r = D] and [r \> D] we obtain 
the following expression for the posterior likelihood 



where P{r \> D\Q) = 1 — p(r = x\Q)dxdd, with Q = C or C. An on-axis 

plot of the log likelihood is shown in Fig. 2. 

These individual likelihoods, measured for each cell, are assumed to be inde- 
pendent across readings, so the posterior odds for occupancy of a given cell 



A([r@D|C)] 



P{r@D\C) p{r = D\C)P{rt>D\C) 
P{r@D\C) ~ p{r = D\C)P{r>D\C) 



( 5 ) 




p{C\[X = D]) _ p{[X@D]\C) P{C) 
p{C\[X = D]) p{\[X@D]\C) P{C) 



X{[X@D]\C)0{C) 



( 6 ) 
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Fig. 2. On axis log likelihood for targets at 1, 2 and 3 m. 
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hence the log likelihood is accumulated using the recursive equation 

log{0{C\[x = D])) = log{X{[X = D]\C)) + log{0{C)) (7) 

Qualitatively, this works as follows: Let us suppose that we obtain a sensor 
reading X=D, and let us consider the posterior odds of a given cell. The posterior 
odds will be given by (7), and will be the addition of the prior odds and the 
reading log likelihood (Fig. 2). If the cell is located between the robot and the 
target, the log likelihood is then less than 0, so the posterior odds will decrease. 
If the cell is close to the target, the log likelihood will be higher than 1 and the 
posterior odds will increase. If the cell is after the target, the log likelihood will 
be close to 0 and, so the posterior odds will remain aproximately constant. 

3 Count Based Models 

In this category we include those methods that assign a degree of occupancy to 
each cell in the map, based on the number of times that it is detected as being 
occupied, and the number of times that it is detected as being empty. In the 
same manner as in the case of the Bayesian models, these methods comprise a 
sensor model and an accumulation method. 



3.1 SUM and RATE Methods 

In our methods we have constructed a very simple sensor model. Each time that 
a sonar echo is perceived we assign the value Vt[i,j] = — 1 to the cells situated 
between the transmitter and the possible obstacle, and the value Vt[i,j] = +1 
to the cells situated on the boundary sector of the sonar cone, independently of 
its position on it (Fig. 3). Cells above 3 meters from the centre of the robot are 
not updated. 

Possible speculative reflections were not considered, nor the distance of the 
ultrasonic source to the possible objective. We have tested sensor models that did 
deal with this last aspect, but these did not significantly improve the approach 
that we present here. Rewarding the centre of the segment to the detriment 
of the outermost points has also been tried (with the function 1 — ^), but no 
apparent improvement in the maps obtained was observed. 

In the SUM model the occupancy value of each cell C[i,j] of the map is 
obtained by algebraically adding the different observations 

Ci+i [h j] = Ct [i, j] + vt [z, j] (8) 

truncating to a maximum cell occupancy value of 127 and a minimum one of 
— 128. The initial value for all cells is 0, i.e., maximum uncertainty. 

This method has a very low computational cost, with a great deal of inertia 
in exchange. That is, if a cell has been shown to be “free of occupancy” over a 
series of n observations, at least n new observations in which it is shown to be 
“occupied” are needed for it to lose the character of “free of occupancy” and for 
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^ Occupied space 
[ I Uncertainty 
I I Free space 



Fig. 3. Sensor model used in SUM and RATE methods, showing the distribution of 
free and occupied space with regard to one ultrasound measurement. 



it to be shown with the characteristics of maximum uncertainty, and a further n 
observations, in which it must carry on appearing as being occupied, are needed 
in order for it to classed as “occupied with the degree of certainty n”. This is 
why the method is very efficient in determining static characteristics of maps, by 
filtering moving obstacles and errors. In any case, the maximum and minimum 
occupancy values are a design specification, and it will always be possible to 
adjust the inertia of the map. 

In the RATE method we maintain two matrices of values denominated 
Occ[i,j] and Visits[i, j], so that each time that the sensor model returns a 
value Vt[i,j] = —1 we will increase Visitst[i,j] and each time that Vt[i,j] = 1 is 
obtained we simultaneously increase Occt[i,j] and Visitst[i, j]. The occupancy 
value for each cell is obtained as the relative frequency 



Ctihj] 



Oect [i,j] 
Visitstli, j] 



(9) 



which can be considered as a probability. 

From a computational perspective, this method has a higher cost than the 
previous one, as it has to maintain two matrices per map in place of one. In 
exchange, compared with probabilistic methods, it has the advantage of not 
involving calculi with transcendent functions. 



3.2 HISTOGRAM Method 

This method is an adaptation of a technique proposed by Borenstein and Ke- 
ren [Borenstein and Keren, 1991] designed to detect obstacles online. The origi- 
nal technique increments one single cell in the grid by three units for each range 
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reading. The incremented cell is the one that corresponds to the measured dis- 
tance r and lies in the acoustic axis of the sensor (black cell in Fig. 4). The 
cells lying in the acoustic axis with distances less than the measured value are 
decremented one unit (white cells in Fig. 4). The maximum value accumulated 
in a cell is 15 and the minimum value is 0, so this method cannot distinguish be- 
tween free and unexplored space. We have implemented a modified method that 
maintains the original sensor model, but uses (8) as an accumulation procedure 
in order to provide free-space information. 




B Occupied space 



[ ] Uncertainty 
I I Free space 



Fig. 4. Sensor model used in HISTOGRAM method [Borenstein and Keren, 1991], 
showing the distribution of free and occupied space with regard to one ultrasound 
measurement. 



4 Experimental Results 

In order to construct the maps that are presented in this paper we have used the 
following methodology; 1800 data were collected from an excursion of the NO- 
MAD 200 robot in an indoor environment. During the excursion the robot was 
autonomously moved by a wall following agent (see [Iglesias, Regueiro, Correa 
and Barro, 1997] [Iglesias, Regueiro, Correa and Barro, 1998] for details) which 
followed the wall on the right, and travelled the length of the corridor only once. 
Each datum consisted of 16 distance measurements from an ultrasonic sensor 
ring, and their dead reckoning data associated, stored at a frequency of 8 Hz. 

The route started from a laboratory, crossed a 1.6 m. wide door and then 
followed a 2 m. wide corridor to the right, crossed a 2.7 m. x 5.2 m. hall, 
followed straight on for approximately 3 m. and then stopped. The centre of 
the laboratory was free, while there was several obstacles close to the walls. 
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The corridor had three doors on the left, close to the hall, and the hall had an 
obstacle (coffee machine) on the right, and a corridor and a door on the left. A 
detail of the route can be found in Fig. 5. The average speed was 0.1 m/s and 
the total distance covered was about 30 m. 

The different methods are applied to this set of data in order to obtain 
the occupancy values of the 180 x 230 square cells, 0.10 m. wide, in a regular 
grid. In order to visualize the maps, the occupancy values have been linearly 
transformed to a grey scale in the range of [0,255]. The dimensions of several 
features in the environment were measured in each map and compared with the 
actual values measured directly in the environment. The distances tested were 
the width and length of the corridor and the width and length of the coffee 
machine. All distances were obtained manually several times, and at different 
places, and mean value and standard deviation calculated. Table 1 shows the 
results of this comparation. 




Fig. 5. View of the corridor travelled along by the robot in the experiment. 



Figures 6 and 7 show the maps obtained with the two Bayesian methods 
implemented. As can be seen the definition of the walls is poor. 

Figure 8 shows the occupancy map obtained with the SUM method. The 
cleanness and definition are noteworthy, both in the free space as well as obsta- 
cles. The doors along the route can also be seen. 

The map obtained with the RATE method is shown in Fig. 9. This map 
is of a poorer quality than the previous one. Contrary to what happened in 
the former, the uncertainty error in the sector are very difficult to eliminate 
and they cast a shadow over the part of the corridor which really corresponds 
to free space. Comparing the two methods, which have the sensor model in 
common, the determining factor in order to decide on the occupancy of a cell 
was a “democratic vote” on “ how many in favour, how many against 
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Fig. 6. Map of the reference environment built according to the MURIEL method 
[Konolige, 1996]. 




Fig. 7. Map of the reference environment built according to the ELFES method [Elfes, 
1989a]. 



and ... the majority decides” in place of “we have this percentage of the 
vote in favour of the cell being occupied” 

The map in Fig. 10 was obtained by means of the HISTOGRAM method. 
This method detects obstacles well, and as far as free space is concerned, ap- 
parently it signals this more poorly than the previous methods. Nevertheless, 
we have verified that, in the same manner as the RATE method, this can be 
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Fig. 8. Occupancy map built according to the SUM method. 




Fig. 9. Occupancy map built according to the RATE method. 



avoided by simply considering those values under a certain threshold to be free 
spaces. None of the methods perform well in the case of the robot remaining idle 
for long periods of time. 

The results of the comparation of several environment dimensions measured 
in each map versus the actual distances is shown in Table 1, in which it is shown 
that all the distances measured lie within the measurement error range. 
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Fig. 10. Occupancy map built according to the HISTOGRAM method (see text for 
details). 

Table 1. Comparation between real environment dimensions and map dimensions. 





Actnal dist 


SUM 


RELATIVE 


ELFES 


MURIEL 


HISTO 


corridor width 


2.00 m 


1.94T0.07 


2.01T0.05 


1.92i0.08 


1.66i0.13 


1.97i0.07 


corridor length 


13.30 m 


13.2i0.ll 


13.2i0.10 


13.16i0.07 


13.li0.ll 


13.27i0.04 


obstacule width 


0.68 m 


0.74i0.11 


0.73i0.38 


0.75i0.3 


0.92i0.24 


0.62i0.11 


obstacule length 


0.58 m 


0.72T0.14 


0.65i0.07 


0.57i0.06 


0.79i0.26 


0.55i0.11 



With the aim of emphasizing obstacle detection in the SUM method we have 
strengthened the occupancy value establishing it at +3, obtaining the map in 
Fig. 11. 



5 Statistical Analysis 

The maps obtained with the five different treatments are, at first view, very 
different, as can be seen in the figures. These differences were statistically con- 
firmed in the following manner. From each map the value of each cell of the 
map linearly transformed to the range [—128,127] are stored. The data from 
each data set do not follow a known distribution, and for this reason any test to 
be used for comparing them must be non-parametric. Moreover, all maps were 
generated from the same set of sensorial data, due to which the statistical test 
has to take into account that the data set are related. 

Bearing this in mind, we decided to apply the non-parametric Friedman test 
for 5 related treatments [Mendenhall, Scheaffer and Wackerly, 1986] to the five 





Probabilistic and Count Methods in Map Building 



133 




Fig. 11. Occnpancy map built according to the SUM method, assigning Vt[i,j] = +3, 
for obstacles. 



data sets obtained. In this test the aim is to prove that the five maps are identical 
(null hypothesis or Hq) as opposed to the hypothesis that at least two of the 
maps are different (alternative hypothesis or Ha). The result was that there were 
significant differences {p — value < 0.001). 

With equality between the set of five maps ruled out, we went on to search for 
similarities between the ten possible pairs formed on the basis of the five maps. 
In order to do this the Wilcoxon test for two related samples was applied. Table 
2 shows all the p — values obtained together with their statistical verification 
(Z). As can be seen, the result was also significant {p— value < 0.001) in all these 
cases. Thus, even though we were working with the same set of data obtained by 
the ultrasonic sensors, the results of the different methods cannot be considered 
to be similar. 

Nevertheless, as a fundamental objective of map construction is the differ- 
entiation between free and occupied space, we have proceeded to carry out a 
second statistical test, aiming to decide with this whether all or some of the 
methods show this characteristic in the same manner. For this, firstly, the map 
data were transformed in the following way: each positive value in the new file 
is assigned the value -1-1 {occupied), each negative value is assigned the value -1 
{free) and the null values {unknown) have been maintained. 

Statistical analysis for these segmented maps (table 3) throws up significant 
differences maps in all cases except SUM-RATE {p — value = 0.157), ELFES- 
RATE {p — value = 0.600) and ELFES-SUM {p— value = 0.593). That is to say 
that with the objective of labelling the space as free or occupied the given pairs 
of methods are similar. 





134 



M. Rodriguez et al. 



Table 2. Verification statistics (Z) and p — value (in brackets) of the Wilcoxon test for 
two related samples, applied to the maps created with each of the five methods used. 



MURIEL 


- 19.031 

( 0 . 000 ) 








SUM 


- 9.778 

( 0 . 000 ) 


- 2.759 

( 0 . 006 ) 






RATE 


- 38.330 

( 0 . 000 ) 


- 29.788 

( 0 . 000 ) 


- 9.174 

( 0 . 000 ) 




HISTOGRAM 


- 61.236 

( 0 . 000 ) 


- 56.229 

( 0 . 000 ) 


- 77.317 

( 0 . 000 ) 


- 63.072 

( 0 . 000 ) 




ELFES 


MURIEL 


SUM 


RATE 



Table 3. Verification statistics (Z) and p — value (in brackets) of the Wilcoxon test for 
two related samples, applied to the map data after transformation into free, occupied 
and unknown spaces. 



MURIEL 


- 10.192 

( 0 . 000 ) 








SUM 


- 0.534 

(0.593) 


- 6.756 

( 0 . 006 ) 






RATE 


- 0.524 

(0.600) 


- 6.764 

( 0 . 000 ) 


- 1.414 

(0.157) 




HISTOGRAM 


- 31.520 

( 0 . 000 ) 


- 26.836 

( 0 . 000 ) 


- 33.588 

( 0 . 000 ) 


- 33.607 

( 0 . 000 ) 




ELFES 


MURIEL 


SUM 


RATE 



6 Discussion 

In this work we present two new count based methods for building occupancy 
maps. One of them, SUM, builds an histogram of the occurrence of free and 
occupied space. The other, RATE, is based on the calculation of the rate between 
the number of times one cell is detected as occupied space and the number of 
times a measurement is got for that cell. The main advantage of count based 
methods is their very low computational cost, so they can operate in real time 
on mobile robots where the computational resources are always limited. 

In order to test these algorithms, two well known probabilistic methods based 
on the Bayes rule, ELFES and a variant of Konolige’s MURIEL in which inde- 
pendent data collection and specular reflection Altering were not present, were 
implemented, as well as, one modified version of the histogramic algorithm. Vi- 
sually comparing all the maps with the real environment results that maps ob- 
tained using count methods lead to more realistic maps than probabilistic ones. 
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Among count methods, HISTOGRAM leads to worse results differentiating free 
space. The reason is the smaller amount of cells for which an occupancy value 
is obtained for each sensor reading. The comparation between the distances vi- 
sually measured in the maps and the actual dimensions shows that in all cases 
the actual dimension lies within the measurement error range. 

Aside from visual comparations, statistical results show that none of the five 
techniques is equal to any other in occupancy map building. However, if the 
cells are labelled as occupied, free or unknown, there are no significant differ- 
ences between SUM, ELFES and RATE methods. This is remarkable, since for 
navigation tasks the important point is to determine whether or not there is free 
space for moving the robot. For these tasks, it would be possible to substitute 
the computationally expensive ELFES method by the simpler and more efficient 
SUM or RATE. 

Count methods have to make strong assumptions in sensor model and accu- 
mulation methods, both stated without proof, but the effect of inaccuracies in 
these assumptions are expected to accumulate incoherently and tend to cancel 
each other out. In the same way, Bayesian methods are not free of such assump- 
tions, related to the practical difficulties in computing the term p{[X = D]\C) 
in the accumulation mechanism (3), the exact calculation of which would involve 
computation over all possible environment configurations. The sensor model in 
this case is also given with no proof. 

Nevertheless, all methods for building occupancy grids assume that occu- 
pancy values of all measurements are independent for all cells. The rationale is 
that, as the robot is mobile, occupancy measurements obtained from different 
positions generate errors that distribute randomly and tend to annul. This hy- 
pothesis is, however, frequently violated in practice, and it was observed that 
some configurations of trajectories and obstacles generate systematic errors that 
accumulate coherently, mainly, in walls and obstacles detected far away. This ef- 
fect shows up in different ways depending on the map building method employed, 
but it is always present. In order to avoid this behavior, it may be necessary to 
modify the models by considering the dependency between adjacent cells, and 
measurements taken from adjacent locations. 

There is an additional problem related to odometry accumulative errors. 
These errors lead to erroneous robot position readings in the long term, while 
they are quite reliable in short displacements. This problem has been addressed 
by different authors recalibrating the odometry with position estimations based 
on sensor readings [Thrun, 1998] [Duckett and Nehmzow, 1998]. Occupancy mea- 
surements are only possible inside the range covered by the sensors, hence occu- 
pancy measurements in a particular cell were taken from locations close to it. In 
this experiment the trajectory followed was uniform and passed each point only 
once, so we can assert that all occupancy measurements in a given cell came 
from positions reached by the robot close in time, and the odometry error is not 
expected to accumulate excessively. The odometric long term error effect shows 
up in all maps as a deviation from the straight line in the shape of the walls. 
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We are currently developing algorithms which use the occupancy grid and 
measurements from a laser range finder to detect natural landmarks (i.e., doors 
and corners), which can be used to recalibrate the odometry [Duckett and Nehm- 
zow, 1998] and to recognize regions. The occupancy grid can also be used to find 
which regions need to be explored or revisited in order to obtain more informa- 
tion about them. 
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Abstract. This paper presents a biologically-inspired method for se- 
lecting visual landmarks which are suitable for navigating within not 
pre-engineered environments. A landmark is a region of the goal image 
which is chosen according to its reliability measured through a phase 
called Turn Back and Look (TBL). This mimics the learning behavior of 
some social insects. The TBL phase affects the conservativeness of the 
vector field thus allowing us to compute the visual potential function 
which drives the navigation to the goal. Furthermore, the conservative- 
ness of the navigation vector held allows us to assess if the learning 
phase has produced good landmarks. The presence of a potential func- 
tion means that classical control theory principles based on the Lyapunov 
functions can be applied to assess the robustness of the navigation strat- 
egy. Results of experiments using a Nomad200 mobile robot and a color 
camera are presented throughout the paper. 



1 Introduction 

Animals, including insects, are proficient in navigating and, in general, several 
biological ways of solving navigational tasks seem to be promising for robotics 
applications. The different methods of navigating have been recently studied and 
categorized as [^: guidance, place recognition - triggered response, topological 
and metric navigations. In order to perform such tasks animals usually deal with 
identifiable objects in the environment called landmarks m 

The use of landmarks in robotics has been extensively studied pirr] . Basi- 
cally, a landmark needs to possess characteristics such as the stationarity, relia- 
bility in recognition, and uniqueness. These properties must be matched with the 
nature of a landmark: landmarks can be artificial or natural. Of course it is much 
easier to deal with artificial landmarks instead of dealing with natural ones, but 
the latter are more appealing because their use requires no engineering of the 
environment. However, a general method of dealing with natural landmarks still 
remains to be introduced. The main problem lies in the selection of the most 
suitable landmarks 
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Recently it has been discovered that wasps and bees perform specific flights 
during the first journey to a new place to learn color, shape and distance of 
landmarks. Such flights are termed Turn Back and Look (TBL) |^. Once the 
place has been recognized using landmarks, insects can then accomplish navi- 
gation actions accordingly. The Cartwright and Collett model [9] is one of the 
main methods of navigating |32| . 

The aim of this paper is to describe the learning system of a biologically- 
inspired navigation method based on natural visual landmarks. The introduced 
system will select natural landmarks from the environment adopting the TBL 
phase (section 13 . 

From the selected landmarks suitable navigation movements will be com- 
puted (section EJ and in Sect. |H the guidance principle and how this can be 
influenced by the TBL phase will be addressed. The measurable effects of TBL 
(namely, the conservativeness of the navigation vector field) is a way to assess 
the quality of the landmarks chosen by the learning phase. 

The conservativeness of the field also permits to compute the (unique) poten- 
tial function which drives landmark navigations and formal ways to assess the 
robustness of the whole approach can be introduced. In fact, the presence of a 
potential function around the goal is a sufficient condition for the application of 
the classical control theory based on Lyapunov functions. Tests performed with 
the Nomad200 will conclude the paper (section O . 

2 Learning Landmarks 

A landmark must be reliable for accomplishing a task and landmarks which 
appear to be appropriate for human beings are not necessarily appropriate for 
robots because of the completely different sensor apparatus and matching sys- 
tems pTTj . 

For example, the necessity of performing specific learning flights allows the 
insects to deal with objects which protrude from the background or which lie 
on a different plane than the background m- Attempts to understand in detail 
the significance of learning flights have been made only recently. Essentially, the 
flights are invariant in certain dynamic and geometric structures thus allowing 
the insects to artificially produce visual cues in specific areas of the eyes pTH] . 
Perhaps, the main reason is that the precision for the homing mostly depends 
upon the proximity of chosen landmarks to the goal m- In fact, those flights 
need to be repeated whenever some changes in the goal position occur [2H] . 
Therefore, it becomes crucial to understand whether or not a landmark is robust 
for the task accomplished by the agent. 

Following the biological background and reconsidering the results presented 
in |31J . one key point is that once the meaning of reliability has been established 
then the problem of selecting landmarks is automatically solved. Therefore, stat- 
ing what is meant by reliability of landmarks, once given the specific sensor and 
matching apparatus, is a mandatory step. 
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2.1 Sensor and Matching Apparatus 

The use of visual landmarks asks for real-time performances and this can lead 
to the use of specific hardware for their identification. The robot NomadSOO 
(figure [IJ that was used to accomplish the tests includes the Fujitsu Tracking 
Card (TRY) which performs real-time tracking of full color templates at a NTSC 
frame rate (30Hz). Basically, a template is a rectangular region of a frame which 
can be identified by two parameters rrix and ruy representing the sizes along X 
and Y axes respectively. 




Fig. 1. The Nomad200 



The card can simultaneously track many templates which have been pre- 
viously stored in a video RAM. For each stored template the card performs 
a match in a sub-area of the actual video frame adopting the block matching 
method. This introduces the concept of correlation between the template and a 
sub-area of the actual video frame. The correlation measure is given by the sum 
of the absolute differences between the values of the pixels. 

To track a template it is necessary to calculate the correlation between the 
template and a frame not only at one point on the frame but at a number 
of points within a searching area. The searching area is composed of 16 x 16 
positions in the frame. The whole set of computed correlation measures is known 
by the term correlation matrix. Examples of correlation matrices are reported in 
figure 12 To perform the tracking, the matching system proposes as an output 
the coordinates of the position which represents the global minimum in the 
correlation matrix. 

This approach strongly resembles the region-based optical flow techniques 
|2]2lj. There, the flow is defined as the shift that yields the best fit between the 
image regions at different times. 
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Fig. 2. Examples of correlation matrices which have been computed in the neighbor- 
hood of the templates (boxes in the pictures) 



2.2 Choosing the Best Landmarks 

Mori et al. have taken advantage of the correlation matrix to generate attention 
tokens from scenes |2^. The method they introduced is suitable to be used in 
the case of self-extraction of landmarks. They introduce a boolean measure to 
select a template: 



l-->5 

9 
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where 5 is a given threshold, g' is the (local) minimum value found in a circle 
(given a radius) around g (the global minimum) in the correlation matrix. Fig- 
ure 1^ exemplifies the method: g' is found in the gray area; g is the minimum 
value at the center of that gray area. 




Fig. 3. The neighborhood used in the Valley Method 



It is clear that this method (termed the Valley Method by the authors) is a 
crude approximation of a more sophisticated method that can explore the shape 
of the distortion matrix giving more information about the template and its 
neighborhood. However, the method works in real-time and for our applications 
this is crucial. 

Relaxing the valley method the following can be obtained: 



where the value r is a measure on how the template is uniquely identifiable in 
its neighborhood: the greater r the more uniquely identifiable the template in 
its neighborhood. Therefore, by reliable landmarks we mean templates which are 
uniquely identifiable. 

There are several degrees of freedom in searching for the best templates. Once 
having stated the definition of reliability, these degrees of freedom are represented 
by: the coordinates of the upper- left corner of the template being considered and 
the template size along X and Y . To avoid an extensive (and expensive) search 
in this 4-dimensional space we need to introduce some simplifications. 

First of all we will consider only squared templates thus reducing the dimen- 
sions of the searching space by 1. This is a very substantial limitation, as instead 
of dealing with 64 possibilities we reduce this number to 8 because only eight 
different values along X and Y are handled by TRV. Some opportunities to deal 
with reliable templates are probably missed. In any case, dealing with squared 



Biologically-Inspired Visual Landmark Learning for Mobile Robots 



143 



templates doesn’t undermine the whole method as we discovered performing the 
tests reported. 

The second chance to speed up the search for the best template is to simplify 
the search for the co-ordinates of its upper left corner. In fact, regardless of 
the size of a template, we should select among 640 x 480 templates (this is the 
resolution of the NTSC signal processed by TRV), by comparing their distortion 
matrixes: even for a real-time tracking card like ours this computation is too big. 

Therefore, for every template size m^y (hereafter, this refers to rrix = rriy) 
we introduce the grid visible in figure The grid has been designed following 
these principles: 

— every cross represents the upper left corner of a possible template whose size 
is m^y 

— there must be enough room to compute the distortion matrix for each tem- 
plate 

For these reasons we have different grids according to different rrixy The 
first (top-left) cross is located in (8 x rrixy,S x rrixy) so as to leave enough room 
for computing one half of the distortion matrix. The same applies to the last 
(bottom-right) cross. 



8*m„ 639-16*m,,y 



8*n\y 






479-16'"m„ 



Fig. 4. The grid introduced to limit the amount of computation in searching for land- 
marks 



As every cross represents the upper-left corner of a potential landmark and 
every template is 16 x 16 x in size, there is an overlap between two close 
templates, both along X and Y as shown in figure El 
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Fig. 5. Overlapping between neighboring templates 



From a computational point of view the maximum number of selectable tem- 
plates is (640 mod m^y) x (480 mod mxy)- Furthermore, a potential landmark 
is not independent of the neighboring ones (because of the overlapping). Once 
again, we should be aware that several combinations of templates are missed but 
considering the results of the performed tests this choice does not undermine the 
whole strategy. 

Another limitation we will introduce concerns the freedom of size. Instead of 
choosing templates which give the opportunity to search for the best combination 
of size and upper- left corner, we keep the size fixed to a given value, for example 
mxy = 5. 

In conclusion, in order to select the best templates (so they can be referred 
to as landmarks) we maximize the following: 

(o^;Op = arg max ri{ox,Oy) (2) 

(ox,Oy) G grid 
mxy fixed 

where ri{ox, Oy) identifies the reliability factor for a landmark I whose upper left 
corner is located in (ox,Oy). Therefore the position (o*,o*) with the highest r 
is returned. In order to assure that landmarks occupy different positions, previ- 
ously chosen coordinates are not considered. Examples of landmarks chosen are 
reported in figure El 

Obviously a landmark can have a different size and position from the original 
one. For example, when the robot is far from the goal place. Therefore, generally. 



Biologically-Inspired Visual Landmark Learning for Mobile Robots 



145 




Fig. 6. Different choices of landmarks for different template sizes 



the matching phase during the navigation task has to deal with the problem of 
guessing which is the actual size and position of a landmark. We can solve the 
problem using an extension of the classical block matching method implemented 
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in TRV. If we suppose that in the set composed by the eight measures of dis- 
tortion, the best value is represented by the minimum one (following the same 
principle that drives the block matching method), then we discover which is the 
actual best size. That is: 

m* = arg min D{ox,Oy,m^y) (3) 

^ l<mxy<S 

where D{ox,Oy,mxy) represents the correlation value measured when the upper 
left corner of the template is located in (ox,Oy) and its size is mxy 

Once the best landmarks have been chosen from the static image of the goal, 
then they are stored in an internal video RAM to be used for successive tasks, 
the first being the Turn Back and Look phase. 



2.3 The Turn Back and Look Phase 

The landmarks which have been statically chosen will be used for navigation 
tasks. We found that it was necessary to test them in order to verify whether 
they represent good guides for navigation tasks. 

TBL can help in verifying this by testing whether during the motion 

the statically chosen landmarks still remain robustly identifiable. 

Through a series of stereotype movements small perturbations (local lighting 
conditions, changes in camera heading, different perspectives and so on) can 
influence the reliability of the statically chosen landmarks. Referring to figure [ 7 ] 
the robot goes from the top to the bottom and the camera (arrows pointing to 
the top) is continously pointing towards the goal. 

These sorts of perturbations occur in typical robot journeys thus allowing us 
to state that the TBL phase represents a testing framework for landmarks. In 
other words, the robot tries to learn which landmarks can be suitably used in 
real navigation tasks by simulating the conditions the robot will encounter along 
the paths. 

At the end of the TBL process only those landmarks that are visible and 
whose reliability r; is above a given threshold are suitable to be used in navigation 
tasks. 

The reliability factor r; for landmark I is continuously computed during the 
TBL phase through the following: 



ri = 



^TBL i 

TBL 



( 4 ) 



where TBL is the total number of steps exploited till that time, and r\ is the 
reliability of landmark I calculated at time i. In the tests, at the end of the phase, 
TBL usually consists of 400 steps (an internal counter) and it takes about 13 
seconds to be performed. In figure [8] two pictures taken during a TBL phase 
exploited by the robot are shown. 
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Fig. 7. The arcs performed by the robot to implement the TBL phase 



3 Navigation from Landmarks 



After reliable landmarks have been chosen then navigation information can be 
extracted from them m- The underlying biological principle is that a real 
movement is represented by an attraction force. It is produced by taking into ac- 
count that the agent tries to restore the original position and size of the involved 
landmarks [^. In fact, during the navigation task we are continuously comput- 
ing the displacement vector (the difference between the actual position and the 
original one) and the actual size. Let di the difference between the original and 
the actual position of a landmark I and Wi a weight thus computed: 



Wi 






( 5 ) 



where M^y is the original landmark size. The weight Wi takes into account if a 
landmark has a bigger size than the original one; in this case Wi is negative but 
still \Wi\ > 1. The attraction given by a landmark I is: 



vi = di- Wi ( 6 ) 

The data can be fused together by weighting them by introducing a sigmoid 
function s(r/) ranging from 0 to 1. The overall navigation vector can be thus 
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Fig. 8. Two pictures taken from a TBL phase; the numbers associated with each land- 
mark represent r; in different times 
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calculated as: 

L 

■ s{n) 

V = [K y,] = (7) 

1=1 

where L is the number of landmarks chosen after TBL, r/ is the reliability value of 
landmark I and vi is the attraction force exertedhy landmark 1. Lastly, Vx and Vy 
represent an estimation of the distance (along x and y axes of the environment) 
from the actual position to the goal position. 

Figure 0 summarizes the situation where the picture represents a typical 
frame taken during a navigation test. In particular, the segment pictured in the 
circle at the bottom-center is proportional to the overall attraction exerted by 
the goal (see Eqn. ED- Above the circle the variance of that attraction is reported 
and under the circle the attraction vector is broken down into a module and an 
angle. In the circle on the right the single attraction exerted by each landmark 
is drawn. Each landmark has associated with it a number given by the value of 
the sigmoid function applied to its reliability measure. The arrows at the top- 
center of the figure represent the motion commands given to the robot. These 
translations are respectively the x and y component of the movement to be taken 
from this position to reach the goal. 

Vector V represents the next movement with the module and direction rela- 
tive to the actual robot position. The system dynamical model is therefore given 
by: 

{ x{k + l) = x{k) + Vx{x{k),y{k)) , . 

\y{k + l)=y{k) + Vy{x{k),y{k)) 

where x{k) and y{k) represent the coordinates of the robot at step fc; Vx{x{k), 
y{k)) and Vy{x{k),y{k)) the displacements computed at step k following Eqn.|7] 
These displacements are related to the position at step k given by {x{k),y{k)). 
Lastly, x{k+ 1) and y{k+ 1) represent the new positions the robot will move to. 
Clearly, an important equilibrium point {x*,y*) for the system is given by the 
coordinates of the goal position. 

The navigation vector computed in Eqn. |7]can be considered as the overall 
attraction exerted by the goal position from that place. 

4 The Guidance Principle 

The system introduced by Eqn. [S] needs to be analysed in some detail. Basi- 
cally, the dynamic system presented can be considered continuous-time with the 
following (omitting the vector notations): 

x{t) = V{x{t)) (9) 

where x represent the generic coordinates and an equilibrium point a;* is located 
at the goal position. 
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Fig. 9. A frame taken during a real navigation task. The segment pictured in the circle 
at the bottom-center is proportional to the overall attraction exercised by the goal (see 
Eqn. El). Above the circle the variance of that attraction is reported and under the 
circle the attraction vector is broken down into a module and an angle. In the circle 
on the right the single attraction exercised by each landmark is drawn. Each landmark 
(box-shaped) has associated with it a number given by the value of the sigmoid function 
applied to its reliability measure: this is necessary to perform data fusion. The arrows 
at the top-center of the figure represent the motion commands given to the robot 



Several important considerations for the stability of the system can be ex- 
pressed focusing attention on its properties. In particular, when a dynamic sys- 
tem can be represented by i = f{x) with a fixed point x*, and it is possible to 
find a Lyapunov function, i.e. a continously differentiable, real-valued function 
U{x) with the following properties [3D]: 

1. U{x) > 0 for all x ^ x* and U{x*) = 0 

2. U{x) < 0 for all x ^ x* (all trajectories flow downhill toward a;*) 

then X* is globally stable: for all initial conditions x{t) x* as t ^ oo. 

The system depicted in Eqn. |D]is of type x = f{x) but, unfortunately, there 
is no systematic way to construct Lyapunov functions |25) . 

A key point to solve the problem is represented by the navigation vector 
field the system produces | 5|14|ll |b|7j. An example of a navigation vector field 
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is represented by figure [10] The goal (represented by a small circle) seems to be 
located within a basin of attraction. 
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Fig. 10. An example of a navigation field: directions and modules (numbers) 



In practice, the whole environment is subdivided into cells and for each of 
them the robot computes the navigation vector as given by Eqn. |3 The entire 
set of vectors is the navigation vector field. 

Formally speaking, a vector field in two dimensions is a function that assigns 
to each point {x,y) of the xy-plane a two-dimensional vector V{x,y) usually 
represented by its two components: 

V{x,y) = [V^{x,y)Vy{x,y)\ (10) 

where Vx{x,y) is the x-component and Vy{x,y) is the y-component. 

Particularly important is the conservative vector field. This is defined as one 
of which the integral computed on a closed path is zero, i.e. the vector field 
V{x,y) is conservative if and only if: 



j)^V{x,y) 



o dr = 0 



( 11 ) 



for any closed path c contained in the field of V; dr is the infinitesimal direction 
of motion. Such a field can always be represented as the gradient of a scalar 
function defined by: 




V{X,Y)odr 



( 12 ) 
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where the path of integration is arbitrary. The scalar function U is referred to 
as the potential of the conservative force V in question. If the function U{x,y) 
is known then the vector field can be determined from the relation: 



V{x,y) = VU{x,y) 



( 13 ) 



All these considerations can be applied to our case. Assuming that the goal 
position is located at the minimum of the potential basin, Eqn. [0] is slightly 
modified with the following [am]: 



y = [R. = - 



dU{x,y) dU{x,y) 



dx 



dy 



( 14 ) 



U(x,y) = - ( V,{X,py)dX- f\y{x,Y)dY 

J V-y J Vn, 



If the field is conservative, the scalar product introduced in Eqn. II 21 can be 
further simplified by following a particular curve c m-- 

cy 

( 15 ) 

^Px 'JPy 

where U{x,y) is the potential function and the path of integration is along 
the horizontal line segment from the reference point {px,Py) to the vertical line 
through (x, y) and then along the vertical line segment to (x, y). Every point can 
be referred to in terms of potential with the goal position. The problem of finding 
a Lyapunov function for Eqn. [8| can thus be solved numerically. If the goal place 
meets the two main Lyapunov criteria (see above) then the trajectories followed 
by the robot converge towards it. 

The convergence to the goal is one of the main features related to the concept 
of robustness. Another key point is the repeatability of the experiments. Quali- 
tatively speaking, the system converges to the goal but does it follow the same 
path when it departs from the same starting point? The analysis of this case 
involves a different property of a vector field: its conservativeness. This follows 
from a theoretical results. In particular, a non-conservative field is one in which 
the circuitation is non-null: the value of U computed by Eqn. [T2I depends on 
the path followed. In other words, U is not determined entirely by the extreme 
points. Therefore the integration process can lead to an infinite set of results de- 
pending on the integrating path c. This means that diverse functions can drive 
the system dynamics thus producing different navigating paths. In addition, the 
convergence to the goal depends on the actual path chosen. 

Therefore, a practical application of Eqn.[TT]to state if the field is conservative 
needs to be found. To this extent, let us suppose that Vx and Vy are smooth and 
continuously differentiable and that the vector field is defined on a connected 
set. Under these hypotheses, a necessary and sufficient condition for the unique 
integration of the vector field is that the following relation {Cauchy- Riemann) 
holds: 

dVx{x,y) _ dVy{x,y) 



dy 

dVx 

dy 



dV^ 

dx 



dx 



= 0 



( 16 ) 



In other terms: 



( 17 ) 
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The first member of Eqn.[l7]can be a measure of the level of conservativeness 
of the vector field. 

In conclusion, if the visual potential function has a basin of attraction with 
the minimum located at the goal position then the theory states that the system 
is intrinsically stable, at least starting navigating from within the convergent 
region of the environment. 



5 Tests 

As previously explained, tests have been performed both to measure how TBL 
affects the conservativeness of the navigation field and to calculate the region of 
convergence for the overall system. 

The collection of the whole set of vectors (see Eqn. IIUI) is performed firstly 
placing the robot in a known position of the environment and then applying 
the method detailed in Eqn. |7] The iteration of the method over the whole 
environment and the collection of every displacement vector produces a vector 
field, as previously exemplified in figure [TO] Each cell is approximately as big as 
the base of the robot. 

Referring to Eqn.[T3the partial derivatives of 14 and Vy (respectively referred 
to as Vxy and Vyx) must be computed. Figure [12] plots the components 14 and 
Vy and their cross derivatives and taken by a real test. 

The conservativeness of the field computed with a threshold set to 0 and land- 
marks sized 6 is shown in figure [H] Only small regions of the whole area roughly 
satisfy the constraint. A small threshold for TBL can dramatically change the 
situation. In figure [T3I the amount of conservativeness for each point is plotted. 

A key consideration is concerned with the scale along z: it is about one 
order of magnitude less than the one reported in figure [TT] A trend toward a 
conservative field is thus becoming evident. 

The situation obtained with a threshold of TBL set to 0.2 has been reported 
in figure [21 A large area of the environment has a measure of conservativeness 
that roughly equals 0. 

Similar considerations can be expressed dealing with a different landmark 
size. For example, figure [Tsi shows the case where the TBL threshold is 0.2 and 
landmarks have a size of 4. The template of the graph is the same as before. 
Therefore, with a good choice of the threshold the field becomes conservative 
regardless of the size of the landmarks. 



5.1 Computation of the Visual Potential Field 

The computation of the visual potential field must be performed only on those 
areas of the environment which are conservative. From the results of the tests 
only two cases can be considered: when TBL threshold is set at 0.2 and landmarks 
have a size of 6 or 4. Starting with the former and following the method detailed 
in the previous section, the visual potential function is shown in figure Hn] 
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Fig. 11. Conservativeness of a vector field computed with a TBL threshold of 0 and 
landmarks sized 6 
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Fig. 13. Conservativeness of a vector field computed with a TBL threshold of 0.1 and 
landmarks sized 6 
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Fig. 14. Conservativeness of a vector held computed with a TBL threshold of 0.2 and 
landmarks sized 6 
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Fig. 15. Conservativeness of a vector field computed with a TBL threshold of 0.2 and 
landmarks sized 4 




X 



Fig. 16. Potential function computed with a TBL threshold set to 0.2 and landmarks 
sized 6 
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The goal position is located in (4, 5) and it represents the reference point 
for Eqn. The shape of the potential function tends to produce a minimum 
around the goal (see figure [TtI) . 



Partial view of U 




Fig. 17. Potential function computed with a TBL threshold set to 0.2 and landmarks 
sized 6 magnified 30 times around the goal 



In addition, roughly speaking, the basin of attraction of the goal is the whole 
environment (where the tests were performed), i.e. apart from some cases, all 
the starting points lead to the goal. To this extent, consider the differences in 
the potential field showed in figure [TOl where the TBL threshold is 0.2 but the 
landmark size is 4. 

There are two important differences: the first concerns the basin of attraction 
and the second is concerned with the depth of the minimum in the goal position. 
The basin of attraction determines how far a goal position can be felt. In other 
words, if the robot starts navigating within the basin of attraction then it reaches 
the goal position. Outside the basin the robot could lead to other (false) goals. 
The visual potential function reported in figure HE possesses a larger basin of 
attraction than the one reported in figure HHl which influences the robot only 
when the robot is close to the goal. 

Intriguingly enough, this has strong analogies with the biological results re- 
ported in m- The authors discussed the area of attraction of the goal (namely, 
the catchment area) considering the size of landmarks surrounding it. Large land- 
marks determine larger catchment areas than smaller landmarks. Furthermore, 
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Partial view of U 




Fig. 18. Potential function computed with a TBL threshold set to 0.2 and landmarks 
sized 4 magnified 30 times around the goal 




X 



Fig. 19. Potential function concerning the case with a TBL threshold set to 0.2 and 
landmarks sized 4 
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larger landmarks determine a coarse approach to the goal whereas smaller land- 
marks allow the insects to precisely pinpointing the goal position. 

To this extent, the visual potential function produced with landmarks sized 
4 has a deeper minimum at the goal than the potential obtained with a size of 
6 (see figure [TSl compared to[l7j. 

Lastly, a comparison between the following two figures can witness the amount 
of robustness gained by the system when learning involves big TBL thresholds. 
The first test conducted and reported in figure doesn’t consider any TBL 
selection phase. 




A O (;OAI. POSITION 



1020 cm 



Fig. 20. Experiments conducted without any TBL phase with landmarks sized 6 

The landmarks used were sized was 6 (that is mxy = 6) and there were 
14 of them. Every navigation trial has been numbered from 1 to 11 and every 
individual ending point has been reported with the notation Gi where i is the 
starting point of navigation i. Sometimes, some navigation trials has no end 
within the environment so its path has been labeled with an arrow at the point 
where it finishes. 

On the other hand, we also need to consider navigation trials accomplished 
after a TBL selection (threshold 0.10) as reported by figure ED In this situation, 
14 landmarks with a size of rrixy = 6 were originally selected. After a TBL phase 
that number was reduced to 8. Now the navigation paths are smoother compared 
to those previously reported. Nevertheless, there are some navigations (7, 3 and 
4) whose ending points do not lie in the right position. Those starting points (at 
least for trials number 3 and 4) can be easily trapped by other minima other 
than the main goal. 
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Fig. 21. Experiments conducted with a TBL phase and landmark sized 6 



5.2 Issues on the Visual Potential as a Lyapunov Function 

From the potential function previously plotted it can be easily understood why 
the system gets sometimes trapped into false goals or what can be the region of 
convergence for the main goal position. 

This implicitly states that the system has no overall stability on the whole 
environment. Therefore, the visual potential function itself cannot be considered 
Lyapunov compliant unless reducing its domain of application to a region around 
the goal position, starting from which the system converges (see e.g. figure [T7). 

6 Related Work 

To date, biologically-based methods have not been widely addressed in robotics. 
Very recently, however, in the survey paper by Trullier et al. m and in books by 
Srinivasan and Venkatesh m and Aloimonos |I] some interesting adaptations of 
methods directly inspired by animals and insects have emerged. This, however, 
represents more a niche development than a classical approach considering that 
even recent reviews that can be defined classical, e.g. mm , do not address 
any biologically-inspired methods. 

Basically, a mile-stone for visual guidance mechanisms is represented by [H], 
where a theoretical computational model to explain the Bee visual guidance 
behavior is introduced. Some interesting extensions to the model introduced in 
jS] are addressed in a recent work |54] . The main contribution concerns the use 
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of a pyramidal structure for landmark matching between current and snapshot 
images that the author considers as uni-dimensional and gray-level. 

An approach to grey-level bi-dimensional images is addressed by Hong et al. 
I16|l . The authors use an omnidirectional device (a hemisphere) for extracting 
the whole panorama. The robot computes a one-dimensional circular location 
signature (an array of 256 elements ranging from 0 to 255) by sampling the 
hemispherical image along a precise radius at angular intervals. Characteristic 
points of this signature are then found by investigating the brightness profile. A 
matching phase is then performed by comparing the signatures (the one taken at 
the goal and the one computed considering the actual image) by an affine model. 
The affine factors are estimated at a preliminary stage using the signatures 
extracted. 

The use of afhne models to compute the final navigation vector has been ad- 
dressed by Salas et al. [2H1- Here, the authors try to recognize and locate artificial 
landmarks scattered across a workspace using algebraic invariants for landmark 
recognition and matching. Those invariants do not change under general affine 
transformations. Once the shape has been recognized the inverse problem con- 
sists of finding the parameters of the geometrical transformation. The parameters 
are directly mapped to real robot displacements. 

Recently, a log-polar transformation for recognizing the most suitable land- 
marks in the environment has been considered by Gaussier et al. jl5] . The land- 
marks are learnt during a preliminary exploration phase of the robot around the 
goal. They are represented by a log-polar transformation of a quasi-panoramic 
image (see the work of Joulain et al. [18] for a detailed explanation of the methods 
used to extract salient landmarks from the environment). The learning module 
is represented by a neural network. Later during the navigation phase the sys- 
tem gives as an input to the neural network the actual (transformed) image and 
its level of activation gives enough informations on the recognition of previously 
learnt landmarks and their position values in the environment. 

In Bianco et al. [J an extension of the models presented in I9I16I34I28I that 
deals with full resolution images has been considered. In order to estimate the 
robot movement, a comparison is made between the actual image and the one 
representing the goal place. All the possible changes due to robot movements 
can be described by a simplified version of a classical affine model. 

A completely different approach has been addressed by Lambrinos et al. m- 
In this work, in particular, a new model called the Average Landmark Vector 
(ALV) is introduced: it explains the insect’s navigation capabilities without con- 
sidering any snapshot or, of course, any matching phase. The authors think that 
the only piece of information needed is the direction of the average landmark vec- 
tor acquired by a summation of unit vectors pointing to landmarks. The target 
direction is calculated in a specific position as is the difference between the AL 
vector and the one at the current position. The ALV model shows very robust 
behavior in complex realistic environments where several landmarks are present. 
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7 Concluding Remarks 

This paper has presented the learning system of a biologically-inspired navigation 
method based on natural visual landmarks. The visual learning phase (TBL) 
affects the conservativeness of the navigation vector field thus allowing us to 
explain landmark navigation in terms of a potential field. 

Conversely, the computation of the conservativeness of a navigation field can 
assess the reliability of the landmarks chosen and, therefore, measure the quality 
of the learning phase. 

Lastly, the presence of a potential function around the goal allows us to apply 
classical control theories to assess the robustness of the overall navigation sys- 
tem. The idea of applying methods from vector analysis to navigation problems 
allows us to evaluate the performance of different models and can represent an 
important step for topological navigations. 
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